The Netflix Tech Blog: SimpleDB

Showing posts with label SimpleDB. Show all posts

Thursday, February 14, 2013

Netflix Queue: Data migration for a high volume web application

by Prasanna Padmanabhan & Shashi Madappa

There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.

What is the Netflix Queue

The Netflix Queue lets you keep and maintain a list of the movies & TV shows you want to watch on your devices and computers.

Previous Implementation

Netflix embraces Service Oriented Architecture (SOA) composed of many small fine grained services that do one thing and one thing well. In that vein, the Queue service is used to fetch and maintain the user’s Queue. For every Netflix user, a list of ordered videos and other meta data related to when and where the video was added to their Queue is persisted in AWS Cloud, with SimpleDB as the source of truth. Data in SimpleDB are sharded across multiple domains (similar to RDBMS tables) for performance and scalability purposes. Queue data is used for both display purposes as well as to influence personalization ranking.

Queue RPS and Data Size

Following graph shows the RPS served by Queue service, with a max of 40K RPS. There are in total of 150+ Million records in our data store, with a total size of 300GB.

Goals

Back when Queue service was originally designed in 2009, SimpleDB was a good solution. However, since then, it has not kept pace with our subscriber growth both in terms of SLA and cost effectiveness. Our goal was to migrate data off of SimpleDB with the following requirements:

High Data Consistency
High Reliability and Availability
No downtime for reads and writes
No degradation in performance of the existing application

After careful considerations and running various performance benchmarks, we decided to use Cassandra as the new data store for Queue service as it suited well for our high volume, low latency writes requirements and for our reads that are primarily accessed through key-value lookups.

Data Migration

Migrating data to an eventually consistent data store, such as Cassandra, for a high volume, low latency application and verifying its consistency is a multi step process. It involves an one time data forklifting and then applying further changes incrementally. There could be error cases where the incremental updates cannot be successfully applied for reasons such as timeouts, throttling of data stores, temporary node unavailability etc. Running an end to end consistency checker and validating data by doing shadow reads helped us better evaluate the consistency of the migrated data. The following sections elaborate on the steps taken to end of life SimpleDB for Queue service.

Our migrator code base is configured to run in one of the three modes viz Forklift, Incremental Replication and Consistency Checker.

a) Forklift

The first step in the process of data migration is to forklift the data from the source data store into the target data store. In this mode, the current snapshot of the source data store is copied in its entirety to the target data store. SimpleDB throttles requests when the RPS to a domain is greater than a certain threshold value to impose fairness on all users of the system. Hence, it is imperative to not put too much load on a SimpleDB domain during the migration, as it would affect the SLA requirements of the existing Queue service. Depending on the data size, throttling of the source data store and the latency requirements for migration, we can choose the number of parallel instances and the number of worker threads within each instance that perform the forklift.

Each migrator thread worked on different data sets within a domain, to avoid migrating the same data multiple times. Based on the configured number of threads, the migrator will automatically chose different data sets for each thread. The migrator is also time aware; it pauses thread execution during peak hours of production traffic and continues forklifting during non-peak hours. The Migrator instances had the state of all forklifting related threads persisted periodically. Thus, if the instance or the forklift application terminates, we could resume the migration from where it had stopped.

Forklift was ran just once as the initial step of the migration process. It look a little over 30 hours to forklift the entire data set.

b) Incremental Replication

This phase was started after the forklift was completed. At this stage, updates to user’s Queue were still only sent to SimpleDB. Migration code was run in Incremental Replication mode to have Cassandra in sync with the updates that happened after forklifting. In this mode, instead of copying all the data from SimpleDB, only the data that were changed since the previous Incremental Replication run were copied to Cassandra.

We had an attribute called Last_Updated_TS in SimpleDB that gets updated for every mutation. This attribute was indexed to get better performance while fetching the source records that were updated since the last run. We only did soft deletes with a delete marker being set in SimpleDB. This mode would not be able to handle hard deletes. Migration code, in this mode, was run continuously.

c) Consistency Checker

At this stage, Incremental Replication was continuously running. However, there could be error cases where the incremental updates cannot be successfully applied to Cassandra for reasons such as timeout, throttling by SimpleDB, temporary node unavailability etc. To circumvent these cases, we ran an end to end Consistency Checker. This mode is similar to Forklift, except that instead of blindly copying the source data, we compared all the data in both the source and the target data stores, and updated the target data store only with the records that mismatched. We kept track of the number of such mismatches for each run and other related meta data about the records that mismatched. Migration code was run continuously even in this mode.

d) Shadow Writes

Following are the steps taken, in chronological order, to update Queue service to use Cassandra and eventually end of life SimpleDB.

Reads: Only from SimpleDB (Source of truth)
Writes: SimpleDB and Cassandra

At this stage, we updated Queue service to do shadow writes to Cassandra. The source of truth for reads was still SimpleDB. For every user request to update their Queue, which earlier used to just update SimpleDB, an additional asynchronous request to update Cassandra was submitted. We kept track of the number of successful/unsuccessful updates to Cassandra. Any unsuccessful update would eventually be fixed by the Incremental Replicator or by the Consistency Checker. Like every other project in Netflix, to make sure our Cassandra cluster could handle the production write traffic, we rolled out this feature incrementally, starting with 1% of our users to 10% and eventually to 100% of our users. This gave us a good indication of the Cassandra write latencies before we made it the source of truth.

e) Shadow Writes and Shadow Reads for Validation

Reads: SimpleDB (Source of truth) and Cassandra
Writes: SimpleDB and Cassandra

Once Shadow writes, Incremental Replication and Consistency checker were up and running, the next step was to do shadow reads. The source of truth for reads still continued to be SimpleDB. At this stage, for every user request to fetch an user’s Queue, an additional asynchronous request to fetch their Queue from Cassandra was submitted. Once the asynchronous request was completed, Queue data returned from both SimpleDB and Cassandra were compared. We kept track of the number of requests for which data in both these stores mismatched. The mismatched records would eventually be fixed by the Incremental Replication or by the Consistency Checker. Again, to make sure our Cassandra cluster could handle the production read traffic, we rolled out this feature incrementally. These shadow read traffic also helped us evaluate the performance of Cassandra read latencies on production traffic patterns.

f) End of Life SimpleDB

Reads: Cassandra (Source of truth)
Writes: Cassandra

Within a short span of time, there were minimal data mismatch (<0.01%) found during Shadow reads, Incremental Replication and Consistency checker. At this stage, we flipped a flag to make Cassandra as the source of truth. After that, all requests to fetch user's Queue were synchronously retrieved from Cassandra and all updates to Queue were written only to Cassandra. SimpleDB was finally laid to rest in peace.

Life at Netflix

When we started this project, the only requirement given to us was to remove SimpleDB as a dependency. It was up to us to choose the right persistence store. We chose Cassandra and designed the correct data models for it. One of the things we loved about this project was the speed at which it was executed, which by the way was completely determined by us. We made several code pushes every week to production, but that comes with a huge responsibility to make sure our codes are well unit and integration tested. It is amazing to see ideas being formed, implemented and pushed to Production in a short span of time.

If these kinds of scalability problems coupled with our freedom and responsibility enthuse you, we are looking for Senior Software Engineers on the Product Infrastructure team. At Netflix, you’ll be working with some of the brightest minds in the industry. Visit http://jobs.netflix.com to get started.

Friday, January 27, 2012

Ephemeral Volatile Caching in the cloud

by Shashi Madappa

In most applications there is some amount of data that will be frequently used. Some of this data is transient and can be recalculated, while other data will need to be fetched from the database or a middle tier service. In the Netflix cloud architecture we use caching extensively to offset some of these operations. This document details Netflix’s implementation of a highly scalable memcache-based caching solution, internally referred to as EVCache.

Why do we need Caching?

Some of the objectives of the Cloud initiative were

Faster response time compared to Netflix data center based solution

Session based App in data center to Stateless without sessions in the cloud

Use NoSQL based persistence like Cassandra/SimpleDB/S3

To solve these we needed the ability to store data in a cache that was Fast, Shared and Scalable. We use cache to front the data that is computed or retrieved from a persistence store like Cassandra, or other Amazon AWS’ services like S3 and SimpleDB and they can take several hundred milliseconds at the 99th percentile, thus causing a widely variable user experience. By fronting this data with a cache, the access times would be much faster & linear and the load on these datastores would be greatly reduced. Caching also enables us to respond to sudden request spikes more effectively. Additionally, an overloaded service can often return a prior cached response; this ensures that user gets a personalized response instead of a generic response. By using caching effectively we have reduced the total cost of operation.

What is EVCache?

EVCache is a memcached & spymemcached based caching solution that is well integrated with Netflix and AWS EC2 infrastructure.

EVCache is an abbreviation for:
Ephemeral - The data stored is for a short duration as specified by its TTL1 (Time To Live).
Volatile - The data can disappear any time (Evicted2).
Cache – An in-memory key-value store.

How is it used?

We have over 25 different use cases of EVCache within Netflix. A particular use case is a users Home Page. For Ex, to decide which Rows to show to a particular user, the algorithm needs to know the Users Taste, Movie Viewing History, Queue, Rating, etc. This data is fetched from various services in parallel and is fronted using EVCache by these services.

Features

We will now detail the features including both add-ons by Netflix and those that come with memcache.

Overview

Distributed Key-Value store, i.e., the cache is spread across multiple instances
AWS Zone-Aware and data can be replicated across zones.
Registers and works with Netflix’s internal Naming Service for automatic discovery of new nodes/services.
To store the data, Key has to be a non-null String and value can be a non-null byte-array, primitives, or serializable object. Value should be less than 1 MB.
As a generic cache cluster that can be used across various applications, it supports an optional Cache Name, to be used as namespace to avoid key collisions.
Typical cache hit rates are above 99%.
Works well with Netflix Persister Framework7. For E.g., In-memory ->backed by EVCache -> backed by Cassandra/SimpleDB/S3

Elasticity and deployment ease: EVCache is linearly scalable. We monitor capacity and can add capacity within a minute and potentially re-balance and warm data in the new node within a few minutes. Note that we have pretty good capacity modeling in place and so capacity change is not something we do very frequently but we have good ways of adding capacity while actively managing the cache hit rate. Stay tuned for more on this scalable cache warmer in an upcoming blog post.
Latency: Typical response time in low milliseconds. Reads from EVCache are typically served back from within the same AWS zone. A nice side effect of zone affinity is that we don’t have any data transfer fees for reads.
Inconsistency: This is a Best Effort Cache and the data can get inconsistent. The architecture we have chosen is speed instead of consistency and the applications that depend on EVCache are capable of handling any inconsistency. For data that is stored for a short duration, TTL ensures that the inconsistent data expires and for the data that is stored for a longer duration we have built consistency checkers that repairs it.
Availability: Typically, the cluster never goes down as they are spread across multiple Amazon Availability Zones. When instances do go down occasionally, cache misses are minimal as we use consistent hashing to shard the data across the cluster.
Total Cost of Operations: Beyond the very low cost of operating the EVCache cluster, one has to be aware that cache misses are generally much costlier - the cost of accessing services AWS SimpleDB, AWS S3, and (to a lesser degree) Cassandra on EC2, must be factored in as well. We are happy with the overall cost of operations of EVCache clusters which are highly stable, linearly scalable.

Under the Hood

Server: The Server consist of the following:

memcached server.
Java Sidecar - A Java app that communicates with the Discovery Service6( Name Server) and hosts admin pages.
Various apps that monitor the health of the services and report stats.

Client: A Java client discovers EVCache servers and manages all the CRUD3 (Create, Read, Update & Delete) operations. The client automatically handles the case when servers are added to or removed from the cluster. The client replicates data (AWS Zone5 based) during Create, Update & Delete Operations; on the other hand, for Read operations the client gets the data from the server which is in the same zone as the client.

We will be open sourcing this Java client sometime later this year so we can share more of our learnings with the developer community.

Single Zone Deployment

The figure below image illustrates the scenario in AWS US-EAST Region4 and Zone-A where an EVCache cluster with 3 instances has a Web Application performing CRUD operations (on the EVcache system).

Upon startup, an EVCache Server instance registers with the Naming Service6 (Netflix’s internal name service that contains all the hosts that we run).
During startup of the Web App, the EVCache Client library is initialized which looks up for all the EVCache server instances registered with the Naming Services and establishes a connection with them.
When the Web App needs to perform CRUD operation for a key the EVCache client selects the instance on which these operations can be performed. We use Consistent Hashing to shard the data across the cluster.

Multi-Zone Deployment

The figure below illustrates the scenario where we have replication across multiple zones in AWS US-EAST Region. It has an EVCache cluster with 3 instances and a Web App in Zone-A and Zone-B.

Upon startup, an EVCache Server instance in Zone-A registers with the Naming Service in Zone-A and Zone-B.
During the startup of the Web App in Zone-A , The Web App initializes the EVCache Client library which looks up for all the EVCache server instances registered with the Naming Service and connects to them across all Zones.
When the Web App in Zone-A needs to Read the data for a key, the EVCache client looks up the EVCache Server instance in Zone –A which stores this data and fetches the data from this instance.
When the Web App in Zone-A needs to Write or Delete the data for a key, the EVCache client looks up the EVCache Server instances in Zone–A and Zone-B and writes or deletes it.

Case Study : Movie and TV show similarity

One of the applications that uses caching heavily is the Similars application. This application suggests Movies and TV Shows that have similarities to each other. Once the similarities are calculated they are persisted in SimpleDB/S3 and are fronted using EVCache. When any service, application or algorithm needs this data it is retrieved from the EVCache and result is returned.

A Client sends a request to the WebApp requesting a page and the algorithm that is processing this requests needs similars for a Movie to compute this data.
The WebApp that needs a list of similars for a Movie or TV show looks up EVCache for this data. Typical cache hit rate is above 99.9%.
If there is a cache miss then the WebApp calls the Similars App to compute this data.
If the data was previously computed but missing in the cache then Similars App will read it from SimpleDB. If it were missing in SimpleDB then the app Calculates the similars for the given Movie or TV show.
This computed data for the Movie or TV Show is then written to EVCache.
The Similars App then computes the response needed by the client and returns it to the client.

Metrics, Monitoring, and Administration

Administration of the various clusters is centralized and all the admin & monitoring of the cluster and instances can be performed via web illustrated below.

The server view below shows the details of each instance in the cluster and also rolls up by the stats for the zone. Using this tool the contents of a memcached slab can be viewed

The EVCache Clusters currently serve over 200K Requests/sec at peak loads. The below chart shows number of requests to EVCache every hour.

The average latency is around 1 millisecond to 5 millisecond. The 99th percentile is around 20 millisecond.

Typical cache hit rates are above 99%.

Join Us

Like what you see and want to work on bleeding edge performance and scale?
We’re hiring !

References

TTL : Time To Live for data stored in the cache. After this time the data will expire and when requested will not be returned.
Evicted : Data associated with a key can be evicted(removed) from the cache even though its TTL has not yet exceed. This happens when the cache is running low on memory and it needs to make some space to add the new data that we are storing. The eviction is based on LRU (Least Recently Used).
CRUD : Create, Read, Update and Delete are the basic functions of storage.
AWS Region : It is a Geographical region and currently in US East (virginia), US West, EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo) and South America (Sao Palo).
AWS Zone: Each availability zone runs on its own physically distinct and independent infrastructure. You can also think this as a data center.
Naming Service : It is a service developed by Netflix and is a registery for all the instances that run Netflix Services.
Netflix Persister Framework : A Framework developed by Netflix that helps user to persist data across various datastore like In-Memory/EVCache/Cassandra/SimpleDB/S3 by providing a simple API.

by Shashi Madappa, Senior Software Engineer, Personalization Infrastructure Team

Friday, January 28, 2011

NoSQL at Netflix

This is Yury Izrailevsky, Director of Cloud and Systems Infrastructure here at Netflix. As Netflix moved into the cloud, we needed to find the appropriate mechanisms to persist and query data within our highly distributed infrastructure. Our goal is to build fast, fault tolerant systems at Internet scale. We realized that in order to achieve this goal, we needed to move beyond the constraints of the traditional relational model. In the distributed world governed by Eric Brewer’s CAP theorem , high availability (a.k.a. better customer experience) usually trumps strong consistency. There is little room for vertical scalability or single points of failure. And while it is not easy to re-architect your systems to not run join queries, or not rely on read-after-write consistency (hey, just cache the value in your app!), we have found ourselves braving the new frontier of NoSQL distributed databases.

Our cloud-based infrastructure has many different use cases requiring structured storage access. Netflix is all about using the right tool for the job. In this post, I’d like to touch on the reasons behind our choice of three such NoSQL tools: SimpleDB, Hadoop/HBase and Cassandra.

Amazon SimpleDB was a natural choice for a number of our use cases as we moved into AWS cloud. SimpleDB is highly durable, with writes automatically replicated across availability zones within a region. It also features some really handy query and data format features beyond a simple key/value interface, such as multiple attributes per row key, batch operations, consistent reads, etc. Besides, SimpleDB is a hosted solution, administered by our friends at AWS. We love it when others do undifferentiated heavy lifting for us; after all, this was one of the reasons we moved to the cloud in the first place. If you are accustomed to other AWS products and services, using SimpleDB is… well, simple – same AWS account, familiar interfaces, APIs, integrated support and billing, etc.

For our systems based on Hadoop, Apache HBase is a convenient, high-performance column-oriented distributed database solution. With its dynamic partitioning model, HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime, which is great for managing our ever-growing data volume needs and avoiding hot spots. Built-in support for data compression, range queries spanning multiple nodes, and even native support for distributed counters make it an attractive alternative for many of our use cases. HBase’s strong consistency model can also be handy, although it comes with some availability trade offs. Perhaps the biggest utility comes from being able to combine real-time HBase queries with batch map-reduce Hadoop jobs, using HDFS as a shared storage platform.

Last but not least, I want to talk about our use of Cassandra. Distributed under the Apache license, Cassandra is an open source NoSQL database that is all about flexibility, scalability and performance. DataStax, a company that professionally support Cassandra, has been great at helping us quickly learn and operate the system. Unlike a distributed database solution using e.g. MySQL or even SimpleDB, Cassandra (like HBase) can scale horizontally and dynamically by adding more servers, without the need to re-shard – or reboot, for that matter. In fact, Cassandra seeks to avoid vertical scalability limits and bottlenecks of any sort: there are no dedicated name nodes (all cluster nodes can serve as such), no practical architectural limitations on data sizes, row/column counts, etc. Performance is strong, especially for the write throughput. Cassandra’s extremely flexible data model deserves a special mention. The sparse two-dimensional “super-column family” architecture allows for rich data model representations (and better performance) beyond just a simple key-value look up. And there are no underlying storage format requirements like HDFS; all you need is a file system. Some of the most attractive features of Cassandra are its uniquely flexible consistency and replication models. Applications can determine at call level what consistency level to use for reads and writes (single, quorum or all replicas). This, combined with customizable replication factor, and special support to determine which cluster nodes to designate as replicas, makes it particularly well suited for cross-datacenter and cross-regional deployments. In effect, a single global Cassandra cluster can simultaneously service applications and asynchronously replicate data across multiple geographic locations.

The reason why we use multiple NoSQL solutions is because each one is best suited for a specific set of use cases. For example, HBase is naturally integrated with the Hadoop platform, whereas Cassandra is best for cross-regional deployments and scaling with no single points of failure. Adopting the non-relational model in general is not easy, and Netflix has been paying a steep pioneer tax while integrating these rapidly evolving and still maturing NoSQL products. There is a learning curve and an operational overhead. Still, the scalability, availability and performance advantages of the NoSQL persistence model are evident and are paying for themselves already, and will be central to our long-term cloud strategy.

Building the leading global content streaming platform is a huge challenge. NoSQL is just one example of an exciting technology area that we aggressively leverage (and in the case of open source projects, contribute back to). Our goal is infinite scale. It takes no less than a superstar team to make it a reality. For those technology superstars out there: Netflix is hiring (http://jobs.netflix.com).