The Netflix Tech Blog: Ribbon

Showing posts with label Ribbon. Show all posts

Wednesday, June 12, 2013

Announcing Zuul: Edge Service in the Cloud

The Netflix streaming application is a complex array of intertwined systems that work together to seamlessly provide our customers a great experience. The Netflix API is the front door to that system, supporting over 1,000 different device types and handing over 50,000 requests per second during peak hours. We are continually evolving by adding new features every day. Our user interface teams, meanwhile, continuously push changes to server-side client adapter scripts to support new features and AB tests. New AWS regions are deployed to and catalogs are added for new countries to support international expansion. To handle all of these changes, as well as other challenges in supporting a complex and high-scale system, a robust edge service that enables rapid development, great flexibility, expansive insights, and resiliency is needed.

Today, we are pleased to introduce Zuul our answer to these challenges and the latest addition to our our open source suite of software Although Zuul is an edge service originally designed to front the Netflix API, it is now being used in a variety of ways by a number of systems throughout Netflix.

Zuul in Netflix's Cloud Architecture

How Does Zuul Work?

At the center of Zuul is a series of filters that are capable of performing a range of actions during the routing of HTTP requests and responses. The following are the key characteristics of a Zuul filter:

Type: most often defines the stage during the routing flow when the filter will be applied (although it can be any custom string)
Execution Order: applied within the Type, defines the order of execution across multiple filters
Criteria: the conditions required in order for the filter to be executed
Action: the action to be executed if the Criteria are met

Here is an example of a simple filter that delays requests from a malfunctioning device in order to distribute the load on our origin:

class DeviceDelayFilter extends ZuulFilter {

    def static Random rand = new Random()
    
    @Override
     String filterType() {
       return 'pre'
     }

    @Override
    int filterOrder() {
       return 5
    }

    @Override
    boolean shouldFilter() {
 return  RequestContext.getRequest().
     getParameter("deviceType")?equals("BrokenDevice"):false
    }

    @Override
    Object run() {
 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds
                                   //between [0-20]
    }
}

Zuul provides a framework to dynamically read, compile, and run these filters. Filters do not communicate with each other directly - instead they share state through a RequestContext which is unique to each request.

Filters are currently written in Groovy, although Zuul supports any JVM-based language. The source code for each filter is written to a specified set of directories on the Zuul server that are periodically polled for changes. Updated filters are read from disk, dynamically compiled into the running server, and are invoked by Zuul for each subsequent request.

Zuul Core Architecture

There are several standard filter types that correspond to the typical lifecycle of a request:

PRE filters execute before routing to the origin. Examples include request authentication, choosing origin servers, and logging debug info.
ROUTING filters handle routing the request to an origin. This is where the origin HTTP request is built and sent using Apache HttpClient or Netflix Ribbon.
POST filters execute after the request has been routed to the origin. Examples include adding standard HTTP headers to the response, gathering statistics and metrics, and streaming the response from the origin to the client.
ERROR filters execute when an error occurs during one of the other phases.

Request Lifecycle

Alongside the default filter flow, Zuul allows us to create custom filter types and execute them explicitly. For example, Zuul has a STATIC type that generates a response within Zuul instead of forwarding the request to an origin.

How We Use Zuul

There are many ways in which Zuul helps us run the Netflix API and the overall Netflix streaming application. Here is a short list of some of the more common examples, and for some we will go into more detail below:

Authentication
Insights
Stress Testing
Canary Testing
Dynamic Routing
Load Shedding
Security
Static Response handling
Multi-Region Resiliency

Insights

Zuul gives us a lot of insight into our systems, in part by making use of other Netflix OSS components. Hystrix is used to wrap calls to our origins, which allows us to shed and prioritize traffic when issues occur. Ribbon is our client for all outbound requests from Zuul, which provides detailed information into network performance and errors, as well as handles software load balancing for even load distribution. Turbine aggregates fine-grained metrics in real-time so that we can quickly observe and react to problems. Archaius handles configuration and gives the ability to dynamically change properties.

Because Zuul can add, change, and compile filters at run-time, system behavior can be quickly altered. We add new routes, assign authorization access rules, and categorize routes all by adding or modifying filters. And when unexpected conditions arise, Zuul has the ability to quickly intercept requests so we can explore, workaround, or fix the problem.

The dynamic filtering capability of Zuul allows us to find and isolate problems that would normally be difficult to locate among our large volume of requests. A filter can be written to route a specific customer or device to a separate API cluster for debugging. This technique was used when a new page from the website needed tuning. Performance problems, as well as unexplained errors were observed. It was difficult to debug the issues because the problems were only happening for a small set of customers. By isolating the traffic to a single instance, patterns and discrepancies in the requests could be seen in real time. Zuul has what we call a “SurgicalDebugFilter”. This is a special “pre” filter that will route a request to an isolated cluster if the patternMatches() criteria is true. Adding this filter to match for the new page allowed us to quickly identify and analyze the problem. Prior to using Zuul, Hadoop was being used to query through billions of logged requests to find the several thousand requests for the new page. We were able to reduce the problem to a search through a relatively small log file on a few servers and observe behavior in real time.

The following is an example of the SurgicalDebugFilter that is used to route matched requests to a debug cluster:

class SharpDebugFilter extends SurgicalDebugFilter {
   private static final Set<String> DEVICE_IDS = ["XXX", "YYY", "ZZZ"]
   @Override
   boolean patternMatches() {
       final RequestContext ctx = RequestContext.getCurrentContext()
       final String requestUri = ctx.getRequest().getRequestURI();
       final String contextUri = ctx.requestURI;
       String id = HTTPRequestUtils.getInstance().
           getValueFromRequestElements("deviceId");
       return DEVICE_IDS.contains(id);
  }
}

In addition to dynamically re-routing requests that match a specified criteria, we have an internal system, built on top of Zuul and Turbine, that allows us to display a real-time streaming log of all matching requests/responses across our entire cluster. This internal system allows us to quickly find patterns of anomalous behavior, or simply observe that some segment of traffic is behaving as expected, (by asking questions such as "how many PS3 API requests are coming from Sao Paolo”)?

Stress Testing

Gauging the performance and capacity limits of our systems is important for us to predict our EC2 instance demands, tune our autoscaling policies, and keep track of general performance trends as new features are added. An automated process that uses dynamic Archaius configurations within a Zuul filter steadily increases the traffic routed to a small cluster of origin servers. As the instances receive more traffic, their performance characteristics and capacity are measured. This informs us of how many EC2 instances will be needed to run at peak, whether our autoscaling policies need to be modified, and whether or not a particular build has the required performance characteristics to be pushed to production.

Multi-Region Resiliency

Zuul is central to our multi-region ELB resiliency project called Isthmus. As part of Isthmus, Zuul is used to bridge requests from the west coast cloud region to the east coast to help us have multi-region redundancy in our ELBs for our critical domains. Stay tuned for a tech blog post about our Isthmus initiative.

Zuul OSS

Today, we are open sourcing Zuul as a few different components:

zuul-core - A library containing a set of core features.
zuul-netflix - An extension library using many Netflix OSS components:

Servo for insights, metrics, monitoring
Hystrix for real time metrics with Turbine
Eureka for instance discovery
Ribbon for routing
Archaius for real-time configuration
Astyanax for and filter persistence in Cassandra

zuul-filters - Filters to work with zuul-core and zuul-netflix libraries
zuul-webapp-simple - A simple example of a web application built on zuul-core including a few basic filters
zuul-netflix-webapp- A web application putting zuul-core, zuul-netflix, and zuul-filters together.

Netflix OSS libraries in Zuul

Putting everything together, we are also providing a web application built on zuul-core and zuul-netflix. The application also provides many helpful filters for things such as:

Weighted load balancing to balance a percentage of load to a certain server or cluster for capacity testing
Request debugging
Routing filters for Apache HttpClient and Netflix Ribbon
Statistics collecting

We hope that this project will be useful for your application and will demonstrate the strength of our open source projects when using Zuul as a glue across them, and encourage you to contribute to Zuul to make it even better. Also, if this type of technology is as exciting to you as it is to us, please see current openings on our team: jobs

Mikey Cohen - API Platform
Matt Hawthorne - API Platform

Monday, March 11, 2013

Introducing the first NetflixOSS Recipe: RSS Reader

by Prasanna Padmanabhan, Shashi Madappa, Kedar Sadekar and Chris Fregly

Over the past year, Netflix has open sourced many of its components such as Hystrix, Eureka, Servo, Astyanax, Ribbon, etc.

While we continue to open source more of our components, it would be useful to open source a set of applications that can tie up these components together. This is the first in the series of “Netflix OSS recipes” that we intend to open source in the coming year that showcases the various components and how they work in conjunction with each other. While we expect many users to cherry pick the parts of Netflix OSS stack, we hope to show how using some of the Netflix OSS components in unison is very advantageous for you.

Our goal is illustrate the power of the Netflix OSS platform by showing real life examples. We hope to increase adoption by building out the Netflix OSS stack, increase awareness by holding more Netflix OSS meetups, lower the barriers by working on push button deployments of our recipes in the coming months, etc.

Netflix OSS Recipe Name: RSS Reader application

This document explains how to build a simple RSS Reader application (also commonly referred to as news aggregator application). As defined in Wikipedia, a RSS reader application does the following:

“In computing, a news aggregator, also termed a feed aggregator, feed reader, news reader, RSS reader or simply aggregator, is client software or a Web application which aggregates syndicated web content such as news headlines, blogs, podcasts, and video blogs (vlogs) in one location for easy viewing.”

The source code for this RSS recipe application can be found on Github. This application uses the following Open Source components.

Archaius: Dynamic configurations properties client.

Astyanax: Cassandra client and pattern library.

Blitz4j: Non-blocking logging.

Eureka: Service registration and discovery.

Governator: Guice based dependency injection.

Hystrix: Dependency fault tolerance.

Karyon: Base server for inbound requests.

Ribbon: REST client for outbound requests.

Servo: Metrics collection.

RSS Reader Recipes Architecture

Recipes RSS Reader is composed of the following three major components:

Eureka Client/Server

All middle tier instances, upon startup, get registered to Eureka using a unique logical name. It is later used by the edge service to discover the available middle tier instances, querying by that logical name.

RSS Middle Tier Service

RSS Middle tier is responsible for fetching the contents of RSS feeds from external feed publishers, parsing the RSS feeds and returning the data via REST entry points. Ribbon’s HTTP client is used to fetch the RSS feeds from external publishers. This tier is also responsible for persisting the user’s RSS subscriptions into Cassandra or into an InMemory hash map. Astyanax is used to persist data into Cassandra. The choice of the data store, cassandra host configurations, retry policies etc are specified in a properties file, which is accessed using Archaius. Latencies, success and failure rates of the calls made to Cassandra and external publishers are captured using Servo. The base container for this middle tier is built on top of Karyon and all log events are captured using Blitz4j.

RSS Edge Service

Eureka is used to discover the middle tier instances that can fetch the RSS feeds subscribed by the user. Hystrix is used to provide greater tolerance of latency and failures when communicating with the middle tier service. The base container for this edge service is also built on top of Karyon. A simple UI, as shown below, is provided by the edge service to add, delete and display RSS feeds.

Hystrix dashboard for the Recipes RSS application during a sample run is as shown below:

Getting Started

Our Github page for Recipes RSS Reader has instructions for building and running the application. Full documentation is available on the wiki pages. The default configurations that come along with this application allows all these different services to run in a single machine.

Future Roadmap

In the coming months, we intend to open source other commonly used recipes that share best practices on how to integrate the various bits of Netflix OSS stack to build highly scalable and reliable cloud-based applications.

Our next NetflixOSS Meetup will be on March 13th and we are going to announce some exciting news at the event. Follow us on twitter follow at @NetflixOSS.