Showing posts with label api. Show all posts
Showing posts with label api. Show all posts

Tuesday, February 10, 2015

Nicobar: Dynamic Scripting Library for Java

By James Kojo, Vasanth Asokan, George Campbell, Aaron Tull


The Netflix API is the front door to the streaming service, handling billions of requests per day from more than 1000 different device types around the world. To provide the best experience to our subscribers, it is critical that our UI teams have the ability to innovate at a rapid pace. As described in our blog post a year ago, we developed a Dynamic Scripting Platform that enables this rapid innovation.

Today, we are happy to announce Nicobar, the open source script execution library that allows our UI teams to inject UI-specific adapter code dynamically into our JVM without the API team’s involvement. Named after a remote archipelago in the eastern Indian Ocean, Nicobar allows each UI team to have its own island of code to optimize the client/server interaction for each device, evolved at its own pace.

Background

As of this post’s writing, a single Netflix API instance hosts hundreds of UI scripts, developed by a dozen teams. Together, they deploy anywhere between a handful to a hundred UI scripts per day. A strong, core scripting library is what allows the API JVM to handle this rate of deployment reliably and efficiently.

Our success with the scripting approach in the API platform led us to identify other applications that could benefit also from the ability to alter their behavior without a full scale deployment. Nicobar is a library that provides this functionality in a compact and reusable manner, with pluggable support for JVM languages.

Architecture Overview

Early implementations of dynamic scripting at Netflix used basic java classloader technology to host and sandbox scripts from one another. While this was a good start, it was not nearly enough. Standard Java classloaders can have only one parent, and thus allow only simple, flattened hierarchies. If one wants to share classloaders, this is a big limitation and an inefficient use of memory. Also, code loaded within standard classloaders is fully visible to downstream classloaders. Finer-grained visibility controls are helpful in restricting what packages are exported and imported into classloaders.

Given these experiences, we designed into Nicobar a script module loader that holds a graph of inter-dependent script modules. Under the hood, we use JBoss Modules (which is open source) to create java modules. JBoss modules represent powerful extensions to basic Java classloaders, allowing for arbitrarily complex classloader dependency graphs, including multiple parents. They also support sophisticated package filters that can be applied to incoming and outgoing dependency edges.

A script module provides an interface to retrieve the list of java classes held inside it. These classes can be instantiated and methods exercised on the instances, thereby “executing” the script module.

Script source and resource bundles are represented by script archives. Metadata for the archives is defined in the form of a script module specification, where script authors can describe the content language, inter-module dependencies, import and export filters for packages, as well as user specific metadata.

Script archive contents can be in source form and/or in precompiled form (.class files). At runtime, script archives are converted into script modules by running the archive through compilers and loaders that translate any source found into classes, and then loading up all classes into a module. Script compilers and loaders are pluggable, and out of the box, Nicobar comes with compilation support for Groovy 2, as well as a simple loader for compiled java classes.

Archives can be stored into and queried from archive repositories on demand, or via a continuous repository poller. Out of the box, Nicobar comes with a choice of file-system based or Cassandra based archive repositories. 

As the usage of a scripting system grows in scale, there is often the need for an administrative interface that supports publishing and modifying script archives, as well as viewing published archives. Towards this end, Nicobar comes with a manager and explorer subproject, based on Karyon and Pytheas.

Putting it all together

The diagram below illustrates how all the pieces work together.


Usage Example - Hello Nicobar!

Here is an example of initializing the Nicobar script module loader to support Groovy scripts.

Create a script archive

Create a simple groovy script archive, with the following groovy file:

Add a module specification file moduleSpec.json, along with the source:

Jar the source and module specification together as a jar file. This is your script archive.


Create a script module loader

Create an archive repository

If you have more than a handful of scripts, you will likely need a repository representing the collection. Let’s create a JarArchiveRepository, which is a repository of script archive jars at some file system path. Copy helloworld.jar into /tmp/archiveRepo to match the code below.

Hooking up the repository poller provides dynamic updates of discovered modules into the script module loader. You can wire up multiple repositories to a poller, which would poll them iteratively.

Execute script
Script modules can be retrieved out of the module loader by name (and an optional version). Classes can be retrieved from script modules by name, or by type. Nicobar itself is agnostic to the type of the classes held in the module, and leaves it to the application’s business logic to decide what to extract out and how to execute classes.

Here is an example of extracting a class implementing Callable and executing it:

At this point, any changes to the script archive jar will result in an update of the script module inside the module loader and new classes reflecting the update will be vended seamlessly!

More about the Module Loader

In addition to the ability to dynamically inject code, Nicobar’s module loading system also allows for multiple variants of a script module to coexist, providing for runtime selection of a variant. As an example, tracing code execution involves adding instrumentation code, which adds overhead. Using Nicobar, the application could vend classes from an instrumented version of the module when tracing is needed, while vending classes from the uninstrumented, faster version of the module otherwise. This paves the way for on demand tracing of code without having to add constant overhead on all executions. 

Module variants can also be leveraged to perform slow rollouts of script modules. When a module deployment is desired, a portion of the control flow can be directed through the new version of the module at runtime. Once confidence is gained in the new version, the update can be “completed”, by flushing out the old version and sending all control flow through the new module.

Static parts of an application may benefit from a modular classloading architecture as well. Large applications, loaded into a monolithic classloader can become unwieldy over time, due to an accumulation of unintended dependencies and tight coupling between various parts of the application. In contrast, loading components using Nicobar modules allows for well defined boundaries and fine-grained isolation between them. This, in turn, facilitates decoupling of components, thereby allowing them to evolve independently

Conclusion

We are excited by the possibilities around creating dynamic applications using Nicobar. As usage of the library grows, we expect to see various feature requests around access controls, additional persistence and query layers, and support for other JVM languages.

Project Jigsaw, the JDK’s native module loading system, is on the horizon too, and we are interested in seeing how Nicobar can leverage native module support from Jigsaw.

If these kinds of opportunities and challenges interest you, we are hiring and would love to hear from you!

Monday, March 3, 2014

The Netflix Dynamic Scripting Platform

The Engine that powers the Netflix API
Over the past couple of years, we have optimized the Netflix API with a view towards improving performance and increasing agility. In doing so, the API has evolved from a provider of RESTful web services to a platform that distributes development and deployment of new functionality across various teams within Netflix.

At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment. As you can imagine, this powerful capability is useful in many scenarios. The API Server is one use case, and in this post, we describe how we use this platform to support a distributed development model at Netflix.

As a reminder, devices make HTTP requests to the API to access content from a distributed network of services. Device Engineering teams use the Dynamic Scripting Platform to deploy new or updated functionality to the API Server based on their own release cycles. They do so by uploading adapter code in the form of Groovy scripts to a running server and defining custom endpoints to front those scripts. The scripts are responsible for calling into the Java API layer and constructing responses that the client application expects. Client applications can access the new functionality within minutes of the upload by requesting the new or updated endpoints. The platform can support any JVM-compatible language, although at the moment we primarily use Groovy.

Architecting for scalability is a key goal for any system we build at Netflix and the Scripting Platform is no exception. In addition, as our platform has gained adoption, we are developing supporting infrastructure that spans the entire application development lifecycle for our users. In some ways, the API is now like an internal PaaS system that needs to provide a highly performant, scalable runtime; tools to address development tasks like revision control, testing and deployment; and operational insight into system health. The sections that follow explore these areas further.

Architecture

Here is a view into the API Server under the covers.

scriptplatform_apiserver2_25.png
[1] Endpoint Controller routes endpoint requests to the corresponding groovy scripts. It consults a registry to identify the mapping between an endpoint URI and its backing scripts.

[2] Script Manager handles requests for script management operations. The management API is exposed as a RESTful interface.

[3] Script source, compiled bytecode and related metadata are stored in a Cassandra cluster, replicated across all AWS regions.

[4] Script Cache is an in memory cache that holds compiled bytecode fetched from Cassandra. This eliminates the Cassandra lookup during endpoint request processing. Scripts are compiled by the first few servers running a new API build and the compiled bytecode is persisted in Cassandra. At startup, a server looks for persisted bytecode for a script before attempting to compile it in real time. Because deploying a set of canary instances is a standard step in our delivery workflow, the canary servers are the ones to incur the one-time penalty for script compilation. The cache is refreshed periodically to pick up new scripts.

[5] Admin Console and Deployment Tools are built on top of the script management API.

Script Development and Operations

Our experience in building a Delivery Pipeline for the API Server has influenced our thinking around the workflows for script management. Now that a part of the client code resides on the server in the form of scripts, we want to simplify the ways in which Device teams integrate script management activities into their workflows. Because technologies and release processes vary across teams, our goal is to provide a set of offerings from which they can pick and choose the tools that best suit their requirements.

The diagram below illustrates a sample script workflow and calls out the tools that can be used to support it. It is worth noting that such a workflow would represent just a part of a more comprehensive build, test and deploy process used for a client application.


scriptpipelineforblogv1.png

Distributed Development

To get script developers started, we provide them with canned recipes to help with IDE setup and dependency management for the Java API and related libraries.  In order to facilitate testing of the script code, we have built a Script Test Framework based on JUnit and Mockito. The Test Framework looks for test methods within a script that have standard JUnit annotations, executes them and generates a standard JUnit result report. Tests can also be run against a live server to validate functionality in the scripts.

Additionally, we have built a REPL tool to facilitate ad-hoc testing of small scripts or snippets of groovy code that can be shared as samples, for debugging etc.


Distributed Deployment

As mentioned earlier, the release cycles of the Device teams are decoupled from those of the API Team. Device teams have the ability to dynamically create new endpoints, update the scripts backing existing endpoints or delete endpoints as part of their releases. We provide command line tools built on top of our Endpoint Management API that can be used for all deployment related operations. Device teams use these tools to integrate script deployment activities with their automated build processes and manage the lifecycle of their endpoints. The tools also integrate with our internal auditing system to track production deployments.

Admin Operations & Insight

Just as the operation of the system is distributed across several teams, so is the responsibility of monitoring and maintaining system health. Our role as the platform provider is to equip our users with the appropriate level of insight and tooling so they can assume full ownership of their endpoints. The Admin Console provides users full visibility into real time health metrics, deployment activity, server resource consumption and traffic levels for their endpoints.








Engineers on the API team can get an aggregate view of the same data, as well as other metrics like script compilation activity that are crucial to track from a server health perspective. Relevant metrics are also integrated into our real time monitoring and alerting systems.


The screenshot to the left is from a top-level dashboard view that tracks script deployment activity.

Experiences and Lessons Learnt

Operating this platform with an increasing number of teams has taught us a lot and we have had several great wins!
  • Client application developers are able to tailor the number of network calls and the size of the payload to their applications. This results in more efficient client-side development and overall, an improved user experience for Netflix customers.
  • Distributed, decoupled API development has allowed us to increase our rate of innovation.
  • Response and recovery times for certain classes of bugs have gone from days or hours down to minutes. This is even more powerful in the case of devices that cannot easily be updated or require us to go through an update process with external partners.
  • Script deployments are inherently less risky than server deployments because the impact of the former is isolated to a certain class or subset of devices. This opens the door for increased nimbleness.
  • We are building an internal developer community around the API. This provides us an opportunity to promote collaboration and sharing of  resources, best practices and code across the various Device teams.

As expected, we have had our share of learnings as well.
  • The flexibility of our platform permits users to utilize the system in ways that might be different from those envisioned at design time. The strategy that several teams employed to manage their endpoints put undue stress on the API server in terms of increased system resources, and in one instance, caused a service outage. We had to react quickly with measures in the form of usage quotas and additional self-protect switches while we identified design changes to allow us to handle such use cases.
  • When we chose Cassandra as the repository for script code for the server, our goal was to have teams use their SCM tool of choice for managing script source code. Over time, we are finding ourselves building SCM-like features into our Management API and tools, as we work to address the needs of various teams. It has become clear to us that we need to offer a unified set of tools that cover deployment and SCM functionality.
  • The distributed development model combined with the dynamic nature of the scripts makes it challenging to understand system behavior and debug issues. RxJava introduces another layer of complexity in this regard because of its asynchronous nature. All of this highlights the need for detailed insights into scripts’ usage of the API.

We are actively working on solutions for the above and will follow up with another post when we are ready to share details.

Conclusion


The evolution of the Netflix API from a web services provider to a dynamic platform has allowed us to increase our rate of innovation while providing a high degree of flexibility and control to client application developers. We are investing in infrastructure to simplify the adoption and operation of this platform. We are also continuing to evolve the platform itself as we find new use cases for it. If this type of work excites you, reach out to us - we are always looking to add talented engineers to our team!

Monday, November 18, 2013

Preparing the Netflix API for Deployment

At Netflix, we are committed to bringing new features and product enhancements to customers rapidly and frequently. Consequently, dozens of engineering teams are constantly innovating on their services, resulting in a rate of change to the overall service is vast and unending. Because of our appetite for providing such improvements to our members, it is critical for us to maintain a Software Delivery pipeline that allows us to deploy changes to our production environments in an easy, seamless, and quick way, with minimal risk.

Wednesday, August 14, 2013

Deploying the Netflix API

by Ben Schmaus

As described in previous posts (“Embracing the Differences” and “Optimizing the Netflix API”), the Netflix API serves as an integration hub that connects our device UIs to a distributed network of data services.  In supporting this ecosystem, the API needs to integrate an ever-evolving set of features from these services and expose them to devices.  The faster these features can be delivered through the API, the faster they can get in front of customers and improve the user experience.

Along with the number of backend services and device types (we’re now topping 1,000 different device types), t
he rate of change in the system is increasing, resulting in the need for faster API development cycles. Furthermore, as Netflix has expanded into international markets, the infrastructure and team supporting the API has grown.  To meet product demands, scale the team’s development, and better manage our cloud infrastructure, we've had to adapt our process and tools for testing and deploying the API.


With the context above in mind, this post presents our approach to software delivery and some of the techniques we've developed to help us get features into production faster while minimizing risk to quality of service.

Moving Toward Continuous Delivery

Before moving on it’s useful to draw a quick distinction between continuous deployment and delivery. Per the book, if you’re practicing continuous deployment then you’re necessarily also practicing continuous delivery, but the reverse doesn’t hold true.  Continuous deployment extends continuous delivery and results in every build that passes automated test gates being deployed to production.  Continuous delivery requires an automated deployment infrastructure but the decision to deploy is made based on business need rather than simply deploying every commit to prod.  We may pursue continuous deployment as an optimization to continuous delivery but our current focus is to enable the latter such that any release candidate can be deployed to prod quickly, safely, and in an automated way.


To meet demand for new features and to make a growing infrastructure easier to manage, we’ve been overhauling our dev, build, test, and deploy pipeline with an eye toward a continuous delivery.  Being able to deploy features as they’re developed gets them in front of Netflix subscribers as quickly as possible rather than having them “sit on the shelf.”  And deploying smaller sets of features more frequently reduces the number of changes per deployment, which is an inherent benefit of continuous delivery and helps mitigate risk by making it easier to identify and triage problems if things go south during a deployment.


The foundational concepts underlying our delivery system are simple:  automation and insight.  By applying these ideas to our deployment pipeline we can strike an effective balance between velocity and stability.
Automation - Any process requiring people to execute manual steps repetitively will get you into trouble on a long enough timeline.  Any manual step that can be done by a human can be automated by a computer; automation provides consistency and repeatability.  It’s easy for manual steps to creep into a process over time and so constant evaluation is required to make sure sufficient automation is in place.

Insight - You can't support, understand, and improve what you can't see.  Insight applies both to the tools we use to develop and deploy the API as well as the monitoring systems we use to track the health of our running applications.  For example, being able to trace code as it flows from our SCM systems through various environments (test, stage, prod, etc.) and quality gates (unit tests, regression tests, canary, etc.) on its way to production helps us distribute deployment and ops responsibilities across the team in a scalable way.  Tools that surface feedback about the state of our pipeline and running apps give us the confidence to move fast and help us quickly identify and fix issues when things (inevitably) break.


Development & Deployment Flow

The following diagram illustrates the logical flow of code from feature inception to global deployment to production clusters across all of our AWS regions.  Each phase in the flow provides feedback about the “goodness” of the code, with each successive step providing more insight into and confidence about feature correctness and system stability.




Taking a closer look at our continuous integration and deploy flow, we have the diagram below, which pretty closely outlines the flow we follow today.  Most of the pipeline is automated, and tooling gives us insight into code as it moves from one state to another.






Branches


Currently we maintain 3 long-lived branches (though we’re exploring approaches to cut down the number of branches, with single master being a likely longer term goal) that serve different purposes and get deployed to different environments.  The pipeline is fully automated with the exception of weekly pushes from the release branch, which require an engineer to kick off the global prod deployment.


Test branch - used to develop features that may take several dev/deploy/test cycles and require integration testing and coordination of work across several teams for an extended period of time (e.g., more than a week).  The test branch gets auto deployed to a test environment, which varies in stability over time as new features undergo development and early stage integration testing.  When a developer has a feature that’s a candidate for prod they manually merge it to the release branch.


Release branch - serves as the basis for weekly releases.  Commits to the release branch get auto-deployed to an integration environment in our test infrastructure and a staging environment in our prod infrastructure.  The release branch is generally in a deployable state but sometimes goes through a short cycle of instability for a few days at a time while features and libraries go through integration testing.  Prod deployments from the release branch are kicked off by someone on our delivery team and are fully automated after the initial action to start the deployment.


Prod branch - when a global deployment of the release branch (see above) finishes it’s merged into the prod branch, which serves as the basis for patch/daily pushes.  If a developer has a feature that's ready for prod and they don't need it to go through the weekly flow then they can commit it directly to the prod branch, which is kept in a deployable state.  Commits to the prod branch are auto-merged back to release and are auto-deployed to a canary cluster taking a small portion of live traffic.  If the result of the canary analysis phase is a “go” then the code is auto deployed globally.

Confidence in the Canary

The basic idea of a canary is that you run new code on a small subset of your production infrastructure, for example, 1% of prod traffic, and you see how the new code (the canary) compares to the old code (the baseline).

Canary analysis used to be a manual process for us where someone on the team would look at graphs and logs on our baseline and canary servers to see how closely the metrics (HTTP status codes, response times, exception counts, load avg, etc.) matched.

Needless to say this approach doesn't scale when you're deploying several times a week to clusters in multiple AWS regions.  So we developed an automated process that compares 1000+ metrics between our baseline and canary code and generates a confidence score that gives us a sense for how likely the canary is to be successful in production.  The canary analysis process also includes an automated squeeze test for each canary Amazon Machine Image (AMI) that determines the throughput “sweet spot” for that AMI in requests per second.  The throughput number, along with server start time (instance launch to taking traffic), is used to configure auto scaling policies.

The canary analyzer generates a report for each AMI that includes the score and displays the total metric space in a scannable grid.  For commits to the prod branch (described above), canaries that get a high-enough confidence score after 8 hours are automatically deployed globally across all AWS regions.

The screenshots below show excerpts from a canary report.  
If the score is too low (< 95 generally means a "no go", as is the case with the canary below), the report helps guide troubleshooting efforts by providing a starting point for deeper investigation. This is where the metrics grid, shown below, helps out. The grid puts more important metrics in the upper left and less important metrics in the lower right.  Green means the metric correlated between baseline and canary.  Blue means the canary has a lower value for a metric ("cold") and red means the canary has a higher value than the baseline for a metric ("hot").




Along with the canary analysis report we automatically generate a source diff report, cross-linked with Jira (if commit messages contain Jira IDs), of code changes in the AMI, and a report showing library and config changes between the baseline and canary.  These artifacts increase our visibility into what’s changing between deployments.

Multi-region Deployment Automation

Over the past two years we've expanded our deployment footprint from 1 to 3 AWS regions supporting different markets around the world.  Running clusters in geographically disparate regions has driven the need for more comprehensive automation.  We use Asgard to deploy the API (with some traffic routing help from Zuul), and we’ve switched from manual deploys using the Asgard GUI to driving deployments programmatically via Asgard’s API.

The basic technique we use to deploy new code into production is the "red/black push."  Here's a summary of how it works.

1) Go to the cluster - which is a set of auto-scaling groups (ASGs) - running the application you want to update, like the API, for example.

2) Find the AMI you want to deploy, look at the number of instances running in the baseline ASG, and launch a new ASG running the selected AMI with enough instances to handle traffic levels at that time (for the new ASG we typically use the number of instances in the baseline ASG plus 10%).  When the instances in the new ASG are up and taking traffic, the new and baseline code is running side by side (ie, "red/red").

3) Disable traffic to the baseline ASG (ie, make it "black"), but keep the instances online in case a rollback is needed.  At this point you'll have your cluster in a "red/black" state with the baseline code being "black" and the new code being "red." If a rollback is needed, since the "black" ASG still has all its instances online (just not taking traffic) you can easily enable traffic to it and then disable the new ASG to quickly get back to your starting state.  Of course, depending on when the rollback happens you may need to adjust server capacity of the baseline ASG.

4) If the new code looks good, delete the baseline ASG and its instances altogether.  The new AMI is now your baseline.

The following picture illustrates the basic flow.



Going through the steps above manually for many clusters and regions is painful and error prone.  To make deploying new code into production easier and more reliable, we've built additional automation on top of Asgard to push code to all of our regions in a standardized, repeatable way.  Our deployment automation tooling is coded to be aware of peak traffic times in different markets and to execute deployments outside of peak times.  Rollbacks, if needed, are also automated.

Keep the Team Informed

With all this deployment activity going on, it’s important to keep the team informed about what’s happening in production.   We want to make it easy for anyone on the team to know the state of the pipeline and what’s running in prod, but we also don’t want to spam people with messages that are filtered out of sight.  To complement our dashboard app, we run an XMPP bot that sends a message to our team chatroom when new code is pushed to a production cluster.  The bot sends a message when a deployment starts and when it finishes.  The topic of the chatroom has info about the most recent push event and the bot maintains a history of pushes that can be accessed by talking to the bot.



Move Fast, Fail Fast (and Small)

Our goal is to provide the ability for any engineer on the team to easily get new functionality running in production while keeping the larger team informed about what’s happening, and without adversely affecting system stability.  By developing comprehensive deployment automation and exposing feedback about the pipeline and code flowing through it we’ve been able to deploy more easily and address problems earlier in the deploy cycle.  Failing on a few canary machines is far superior to having a systemic failure across an entire fleet of servers.  

Even with the best tools, building software is hard work.  We're constantly looking at what's hard to do and experimenting with ways to make it easier.  Ultimately, great software is built by great engineering teams.  If you're interested in helping us build the Netflix API, take a look at some of our open roles.


Wednesday, June 12, 2013

Announcing Zuul: Edge Service in the Cloud

The Netflix streaming application is a complex array of intertwined systems that work together to seamlessly provide our customers a great experience. The Netflix API is the front door to that system, supporting over 1,000 different device types and handing over 50,000 requests per second during peak hours. We are continually evolving by adding new features every day.  Our user interface teams, meanwhile,  continuously push changes to server-side client adapter scripts to support new features and AB tests.  New AWS regions are deployed to and catalogs are added for new countries to support international expansion.  To handle all of these changes, as well as other challenges in supporting a complex and high-scale system, a robust edge service that enables rapid development, great flexibility, expansive insights, and resiliency is needed.

Today, we are pleased to introduce Zuul  our answer to these challenges and the latest addition to our our open source suite of software  Although Zuul is an edge service originally designed to front the Netflix API, it is now being used in a variety of ways by a number of systems throughout Netflix. 


Zuul in Netflix's Cloud Architecture


How Does Zuul Work?

At the center of Zuul is a series of filters that are capable of performing a range of actions during the routing of HTTP requests and responses.  The following are the key characteristics of a Zuul filter:
  • Type: most often defines the stage during the routing flow when the filter will be applied (although it can be any custom string)
  • Execution Order: applied within the Type, defines the order of execution across multiple filters
  • Criteria: the conditions required in order for the filter to be executed
  • Action: the action to be executed if the Criteria are met
Here is an example of a simple filter that delays requests from a malfunctioning device in order to distribute the load on our origin:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class DeviceDelayFilter extends ZuulFilter {

    def static Random rand = new Random()
    
    @Override
     String filterType() {
       return 'pre'
     }

    @Override
    int filterOrder() {
       return 5
    }

    @Override
    boolean shouldFilter() {
 return  RequestContext.getRequest().
     getParameter("deviceType")?equals("BrokenDevice"):false
    }

    @Override
    Object run() {
 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds
                                   //between [0-20]
    }
}

Zuul provides a framework to dynamically read, compile, and run these filters. Filters do not communicate with each other directly - instead they share state through a RequestContext which is unique to each request.

Filters are currently written in Groovy, although Zuul supports any JVM-based language. The source code for each filter is written to a specified set of directories on the Zuul server that are periodically polled for changes. Updated filters are read from disk, dynamically compiled into the running server, and are invoked by Zuul for each subsequent request.

Zuul Core Architecture

There are several standard filter types that correspond to the typical lifecycle of a request:
  • PRE filters execute before routing to the origin. Examples include request authentication, choosing origin servers, and logging debug info.
  • ROUTING filters handle routing the request to an origin. This is where the origin HTTP request is built and sent using Apache HttpClient or Netflix Ribbon.
  • POST filters execute after the request has been routed to the origin.  Examples include adding standard HTTP headers to the response, gathering statistics and metrics, and streaming the response from the origin to the client.
  • ERROR filters execute when an error occurs during one of the other phases.
Request Lifecycle
Alongside the default filter flow, Zuul allows us to create custom filter types and execute them explicitly.  For example, Zuul has a STATIC type that generates a response within Zuul instead of forwarding the request to an origin.  


How We Use Zuul

There are many ways in which Zuul helps us run the Netflix API and the overall Netflix streaming application.  Here is a short list of some of the more common examples, and for some we will go into more detail below:
  • Authentication
  • Insights
  • Stress Testing
  • Canary Testing
  • Dynamic Routing
  • Load Shedding
  • Security
  • Static Response handling
  • Multi-Region Resiliency 


Insights

Zuul gives us a lot of insight into our systems, in part by making use of other Netflix OSS components.  Hystrix is used to wrap calls to our origins, which allows us to shed and prioritize traffic when issues occur.  Ribbon is our client for all outbound requests from Zuul, which provides detailed information into network performance and errors, as well as handles software load balancing for even load distribution.  Turbine aggregates fine-grained metrics in real-time so that we can quickly observe and react to problems.  Archaius handles configuration and gives the ability to dynamically change properties. 

Because Zuul can add, change, and compile filters at run-time, system behavior can be quickly altered. We add new routes, assign authorization access rules, and categorize routes all by adding or modifying filters. And when unexpected conditions arise, Zuul has the ability to quickly intercept requests so we can explore, workaround, or fix the problem. 

The dynamic filtering capability of Zuul allows us to find and isolate problems that would normally be difficult to locate among our large volume of requests.  A filter can be written to route a specific customer or device to a separate API cluster for debugging.  This technique was used when a new page from the website needed tuning.  Performance problems, as well as unexplained errors were observed. It was difficult to debug the issues because the problems were only happening for a small set of customers. By isolating the traffic to a single instance, patterns and discrepancies in the requests could be seen in real time. Zuul has what we call a “SurgicalDebugFilter”. This is a special “pre” filter that will route a request to an isolated cluster if the patternMatches() criteria is true.  Adding this filter to match for the new page allowed us to quickly identify and analyze the problem.  Prior to using Zuul, Hadoop was being used to query through billions of logged requests to find the several thousand requests for the new page.  We were able to reduce the problem to a search through a relatively small log file on a few servers and observe behavior in real time.

The following is an example of the SurgicalDebugFilter that is used to route matched requests to a debug cluster:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class SharpDebugFilter extends SurgicalDebugFilter {
   private static final Set<String> DEVICE_IDS = ["XXX", "YYY", "ZZZ"]
   @Override
   boolean patternMatches() {
       final RequestContext ctx = RequestContext.getCurrentContext()
       final String requestUri = ctx.getRequest().getRequestURI();
       final String contextUri = ctx.requestURI;
       String id = HTTPRequestUtils.getInstance().
           getValueFromRequestElements("deviceId");
       return DEVICE_IDS.contains(id);
  }
}
In addition to dynamically re-routing requests that match a specified criteria, we have an internal system, built on top of Zuul and Turbine, that allows us to display a real-time streaming log of all matching requests/responses across our entire cluster.  This internal system allows us to quickly find patterns of anomalous behavior, or simply observe that some segment of traffic is behaving as expected,  (by asking questions such as "how many PS3 API requests are coming from Sao Paolo”)?


Stress Testing 

Gauging the performance and capacity limits of our systems is important for us to predict our EC2 instance demands, tune our autoscaling policies, and keep track of general performance trends as new features are added.  An automated process that uses dynamic Archaius configurations within a Zuul filter steadily increases the traffic routed to a small cluster of origin servers. As the instances receive more traffic, their performance characteristics and capacity are measured. This informs us of how many EC2 instances will be needed to run at peak, whether our autoscaling policies need to be modified, and whether or not a particular build has the required performance characteristics to be pushed to production.


Multi-Region Resiliency

Zuul is central to our multi-region ELB resiliency project called Isthmus. As part of Isthmus, Zuul is used to bridge requests from the west coast cloud region to the east coast to help us have multi-region redundancy in our ELBs for our critical domains. Stay tuned for a tech blog post about our Isthmus initiative. 


Zuul OSS

Today, we are open sourcing Zuul as a few different components:
  • zuul-core - A library containing a set of core features.
  • zuul-netflix - An extension library using many Netflix OSS components:
    • Servo for insights, metrics, monitoring
    • Hystrix for real time metrics with Turbine
    • Eureka for instance discovery
    • Ribbon for routing
    • Archaius for real-time configuration
    • Astyanax for and filter persistence in Cassandra
  • zuul-filters - Filters to work with zuul-core and zuul-netflix libraries 
  • zuul-webapp-simple -  A simple example of a web application built on zuul-core including a few basic filters
  • zuul-netflix-webapp- A web application putting zuul-core, zuul-netflix, and zuul-filters together.
Netflix OSS libraries in Zuul
Putting everything together, we are also providing a web application built on zuul-core and zuul-netflix.  The application also provides many helpful filters for things such as:
  • Weighted load balancing to balance a percentage of load to a certain server or cluster for capacity testing
  • Request debugging
  • Routing filters for Apache HttpClient and Netflix Ribbon
  • Statistics collecting
We hope that this project will be useful for your application and will demonstrate the strength of our open source projects when using Zuul as a glue across them, and encourage you to contribute to Zuul to make it even better. Also, if this type of technology is as exciting to you as it is to us, please see current openings on our team: jobs  

Mikey Cohen - API Platform
Matt Hawthorne - API Platform