Showing posts with label java. Show all posts

Friday, July 24, 2015

Java in Flames

Java mixed-mode flame graphs provide a complete visualization of CPU usage and have just been made possible by a new JDK option: -XX:+PreserveFramePointer. We've been developing these at Netflix for everyday Java performance analysis as they can identify all CPU consumers and issues, including those that are hidden from other profilers.

Example
This shows CPU consumption by a Java process, both user- and kernel-level, during a vert.x benchmark:

Click to zoom (SVG, PNG). Showing all CPU usage with Java context is amazing and useful. On the top right you can see a peak of kernel code (colored red) for performing a TCP send (which often leads to a TCP receive while handling the send). Beneath it (colored green) is the Java code responsible. In the middle (colored green) is the Java code that is running on-CPU. And in the bottom left, a small yellow tower shows CPU time spent in GC.

We've already used Java flame graphs to quantify performance improvements between frameworks (Tomcat vs rxNetty), which included identifying time spent in Java code compilation, the Java code cache, other system libraries, and differences in kernel code execution. All of these CPU consumers were invisible to other Java profilers, which only focus on the execution of Java methods.

Flame Graph Interpretation

If you are new to flame graphs: The y axis is stack depth, and the x axis spans the sample population. Each rectangle is a stack frame (a function), where the width shows how often it was present in the profile. The ordering from left to right is unimportant (the stacks are sorted alphabetically).

In the previous example, color hue was used to highlight different code types: green for Java, yellow for C++, and red for system. Color intensity was simply randomized to differentiate frames (other color schemes are possible).

You can read the flame graph from the bottom up, which follows the flow of code from parent to child functions. Another way is top down, as the top edge shows the function running on CPU, and beneath it is its ancestry. Focus on the widest functions, which were present in the profile the most. See the CPU flame graphs page for more about interpretation, and Brendan's USENIX/LISA'13 talk (video).

The Problem with Profilers
In order to generate flame graphs, you need a profiler that can sample stack traces. There have historically been two types of profilers used on Java:

System profilers: such as Linux perf_events, which can profile system code paths, including libjvm internals, GC, and the kernel, but not Java methods.
JVM profilers: such as hprof, Lightweight Java Profiler (LJP), and commercial profilers. These show Java methods, but not system code paths.

To understand all types of CPU consumers, we previously used both types of profilers, creating a flame graph for each. This worked – sort of. While all CPU consumers could be seen, Java methods were missing from the system profile, which was crucial context we needed.

Ideally, we would have one flame graph that shows it all: system and Java code together.

A system profiler like Linux perf_events should be well suited to this task as it can interrupt any software asynchronously and capture both user- and kernel-level stacks. However, system profilers don't work well with Java. The problem is shown by the flame graph on the right. The Java stacks and method names are missing.

There were two specific problems to solve:

The JVM compiles methods on the fly (just-in-time: JIT), and doesn't expose a symbol table for system profilers.
The JVM also uses the frame pointer register on x86 (RBP on x86-64) as a general-purpose register, breaking traditional stack walking.

Brendan summarized these earlier this year in his Linux Profiling at Netflix talk for SCALE. Fortunately, there was already a fix for the first problem.

Fixing Symbols
In 2009, Linux perf_events added JIT symbol support, so that symbols from language virtual machines like the JVM could be inspected. To use it, your application creates a /tmp/perf-PID.map text file, which lists symbol addresses (in hex), sizes, and symbol names. perf_events looks for this file by default and, if found, uses it for symbol translations.

Java can create this file using perf-map-agent, an open source JVMTI agent written by Johannes Rudolph. The first version needed to be attached on Java startup, but Johannes enhanced it to attach later on demand and take a symbol dump. That way, we only load it if we need it for a profile. Thanks, Johannes!

Since symbols can change slightly during the profile (we’re typically profiling for 30 or 60 seconds), a symbol dump may include stale symbols. We’ve looked at taking two symbol dumps, before and after the profile, to highlight any such differences. Another approach in development involves a timestamped symbol log to ensure that all translations are accurate (although this requires always-on logging of symbols). So far symbol churn hasn’t been a large problem for us, after Java and JIT have “warmed up” and symbol churn is minimal (this can take a few minutes, given sufficient load). We do bear it in mind when interpreting flame graphs.

Fixing Frame Pointers
For many years the gcc compiler has reused the frame pointer as a compiler optimization, breaking stack traces. Some applications compile with the gcc option -fno-omit-frame-pointer, to preserve this type of stack walking, however, the JVM had no equivalent option. Could the JVM be modified to support this?

Brendan was curious to find out, and hacked a working prototype for OpenJDK. It involved dropping RBP from eligible register pools, eg (diff):

--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad      2014-03-04 02:52:11.000000000 +0000
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad   2014-11-08 01:10:49.686044933 +0000
@@ -166,10 +166,9 @@
 // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
 //

-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
 reg_class any_reg(RAX, RAX_H,
                   RDX, RDX_H,
-                  RBP, RBP_H,
                   RDI, RDI_H,
                   RSI, RSI_H,
                   RCX, RCX_H,

... and then fixing the function prologues to store the stack pointer (rsp) into the frame pointer (base pointer) register (rbp):

--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04 02:52:11.000000000 +0000
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp      2014-11-07 23:57:11.589593723 +0000
@@ -5236,6 +5236,7 @@
     // We always push rbp, so that on return to interpreter rbp, will be
     // restored correctly and we can correct the stack.
     push(rbp);
+    mov(rbp, rsp);
     // Remove word for ebp
     framesize -= wordSize;

It worked. Here are the before and after flame graphs. Brendan posted it, with example flame graphs, to the hotspot compiler devs mailing list. This feature request became JDK-8068945 for JDK9 and JDK-8072465 for JDK8.

Fixing this properly involved a lot more work (see discussions in the bugs and mailing list). Zoltán Majó, of Oracle, took this on and rewrote the patch. After testing, it was finally integrated into the early access releases of both JDK9 and JDK8 (JDK8 update 60 build 19), as the new JDK option: -XX:+PreserveFramePointer.

Many thanks to Zoltán, Oracle, and the other engineers who helped get this done!

Since use of this mode disables a compiler optimization, it does decrease performance slightly. We've found in tests that this costs between 0 and 3% extra CPU, depending on the workload. See JDK-8068945 for some additional benchmarking details. There are also other techniques for walking stacks, some with zero run time cost to make available, however, there are other downsides with these approaches.

Instructions
The following steps describe how these flame graphs can be created. We’re working on improving and automating these steps using Vector (more on that in a moment).

1. Install software
There are four components to install:

Linux perf_events
This is the standard Linux profiler, aka “perf” after its front end, and is included in the Linux source (tools/perf). Try running perf help to see if it is installed; if not, your distro may suggest how to get it, usually by adding a perf-tools-common package.

Java 8 update 60 build 19 (or newer)
This includes the frame pointer patch fix (JDK-8072465), which is necessary for Java stack profiling. It is currently released as early access (built from OpenJDK).

perf-map-agent
This is a JVMTI agent that provides Java symbol translation for perf_events is on github. Steps to build this typically involve:

apt-get install cmake
export JAVA_HOME=/path-to-your-new-jdk8
git clone --depth=1 https://github.com/jrudolph/perf-map-agent
cd perf-map-agent
cmake .
make

The current version of perf-map-agent can be loaded on demand, after Java is running.
WARNING: perf-map-agent is experimental code – use at your own risk, and test before use!

FlameGraph
This is some Perl software for generating flame graphs. It can be fetched from github:

git clone --depth=1 https://github.com/brendangregg/FlameGraph

This contains stackcollapse-perf.pl, for processing perf_events profiles, and flamegraph.pl, for generating the SVG flame graph.

2. Configure Java
Java needs to be running with the -XX:+PreserveFramePointer option, so that perf_events can perform frame pointer stack walks. As mentioned earlier, this can cost some performance, between 0 and 3% depending on the workload.

3a. Generate System Wide Flame Graphs
With this software and Java running with frame pointers, we can profile and generate flame graphs.

For example, taking a 30-second profile at 99 Hertz (samples per second) of all processes, then caching symbols for Java PID 1690, then generating a flame graph:

sudo perf record -F 99 -a -g -- sleep 30
java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce 1690    # run as same user as java
sudo chown root /tmp/perf-*.map
sudo perf script | stackcollapse-perf.pl | \
    flamegraph.pl --color=java --hash > flamegraph.svg

The attach-main.jar file is from perf-map-agent, and stackcollapse-perf.pl and flamegraph.pl are from FlameGraph. Specify their full paths unless they are in the current directory.

These steps address some quirky behavior involving user permissions: sudo perf script only reads symbol files the current user (root) owns, and, perf-map-agent creates files with the same user ownership as the Java process, which for us is usually non-root. This means we have to change the ownership to root for the symbol file, and then run perf script.

With jmaps
Dealing with symbol files has become a chore, so we’ve been automating it. Here’s one example: jmaps, which can be used like so:

sudo perf record -F 99 -a -g -- sleep 30; sudo jmaps
sudo perf script | stackcollapse-perf.pl | \
    flamegraph.pl --color=java --hash > flamegraph.svg

jmaps creates symbol files for all Java processes, with root ownership. You may want to write a similar “jmaps” helper for your environment (our jmaps example is unsupported). Remember to clean up the /tmp symbol files when you no longer need them!

3b. Generate By-Process Flame Graphs
The previous procedure grouped Java processes together. If it is important to separate them (and, on some of our instances, it is), you can modify the procedure to generate a by-process flame graph. Eg (with jmaps):

sudo perf record -F 99 -a -g -- sleep 30; sudo jmaps
sudo perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace | \
    stackcollapse-perf.pl --pid | \
    flamegraph.pl --color=java --hash > flamegraph.svg

The output of stackcollapse-perf.pl formats each stack as a single line, and is great food for grep/sed/awk. For the flamegraph at the top of this post, we used the above procedure, and added “| grep java-339” before the “| flamegraph.pl”, to isolate that one process. You could also use a “| grep -v cpu_idle”, to exclude the kernel idle threads.

Missing Frames
If you start using these flame graphs, you’ll notice that many Java frames (methods) are missing. Compared to the jstack(1) command line tool, the stacks seen in the flame graph look perhaps one third as deep, and are missing many frames. This is because of inlining, combined with this type of profiling (frame pointer based) which only captures the final executed code.

This hasn’t been much of a problem so far: even when many frames are missing, enough remain that we can figure out what’s going on. We’ve also experimented with reducing the amount of inlining, eg, using -XX:InlineSmallCode=500, to increase the number of frames in the profile. In some cases this even improves performance slightly, as the final compiled instruction size is reduced, fitting better into the processor caches (we confirmed this using perf_events separately).

Another approach is to use JVMTI information to unfold the inlined symbols. perf-map-agent has a mode to do this; however, Min Zhou from LinkedIn has experienced Java crashes when using this, which he has been fixing in his version. We’ve not seen these crashes (as we rarely use that mode), but be warned.

Vector
The previous steps for generating flame graphs are a little tedious. As we expect these flame graphs will become an everyday tool for Java developers, we’ve looked at making them as easy as possible: a point-and-click interface. We’ve been prototyping this with our open source instance analysis tool: Vector.

Vector was described in more details in a previous techblog post. It provides a simple way for users to visualize and analyze system and application-level metrics in near real-time, and flame graphs is a great addition to the set of functionalities it already provides.

We tried to keep the user interaction as simple as possible. To generate a flame graph, you connect Vector to the target instance, add the flame graph widget to the dashboard, then click the generate button. That's it!

Behind the scenes, Vector requests a flame graph from a custom instance agent that we developed, which also supplies Vector's other metrics. Vector checks the status of this request while fetching and displaying other metrics, and displays the flame graph when it is ready.

Our custom agent is not generic enough to be used by everyone yet (it depends on the Netflix environment), so we have yet to open-source it. If you're interested in testing or extending it, reach out to us.

Future Work
We have some enhancements planned. One is for regression analysis, by automatically collecting flame graphs over different days and generating flame graph differentials for them. This will help us quickly understand changes in CPU usage due to software changes.

Apart from CPU profiling, perf_events can also trace user- and kernel-level events, including disk I/O, networking, scheduling, and memory allocation. When these are synchronously triggered by Java, a mixed-mode flame graph will show the code paths that led to these events. A page fault mixed-mode flame graph, for example, can be used to show which Java code paths led to an increase in main memory usage (RSS).

We also want to develop enhancements for flame graphs and Vector, including real time updates. For this to work, our agent will collect perf_events directly and return a data structure representing the partial flame graph to Vector with every check. Vector, with this information, will be able to assemble the flame graph in real time, while the profile is still being collected. We are also investigating using D3 for flame graphs, and adding interactivity improvements.

Other Work
Twitter have also explored making perf_events and Java work better together, which Kaushik Srenevasan summarized in his Tracing and Profiling talk from OSCON 2014 (slides). Kaushik showed that perf_events has much lower overhead when compared to some other Java profilers, and included a mixed-mode stack trace from perf_events. David Keenan from Twitter also described this work in his Twitter-Scale Computing talk (video), as well as summarizing other performance enhancements they have been making to the JVM.

At Google, Stephane Eranian has been working on perf_events and Java as well and has posted a patch series that supports a timestamped JIT symbol transaction log from Java for accurate symbol translation, solving the stale symbol problem. It’s impressive work, although a downside with the logging technique may be the performance cost of always logging symbols even if a profiler is never used.

Conclusion
CPU mixed-mode flame graphs help identify and quantify all CPU consumers. They show the CPU time spent in Java methods, system libraries, and the kernel, all in one visualization. This reveals CPU consumers that are invisible to other profilers, and have so far been used to identify issues and explain performance changes between software versions.

These mixed-mode flame graphs have been made possible by a new option in the JVM: -XX:+PreserveFramePointer, available in early access releases. In this post we described how these work, the challenges that were addressed, and provided instructions for their generation. Similar visibility for Node.js was described in our earlier post: Node.js in Flames.

by Brendan Gregg and Martin Spier

Tuesday, November 11, 2014

Genie 2.0: Second Wish Granted!

By Tom Gianos and Amit Sharma @ Big Data Platform Team

A little over a year ago we announced Genie, a distributed job and resource management tool. Since then, Genie has operated in production at Netflix, servicing tens of thousands of ETL and analytics jobs daily. There were two main goals in the original design of Genie:

To abstract execution environment from the Hadoop, Hive and Pig job submissions.
To enable horizontal scaling of client resources based on demand.

Since the development of Genie 1.0, much has changed in both the big data ecosystem and here at Netflix. Hadoop 2 was officially released, enabling clusters to use execution engines beyond traditional MapReduce. Newer tools, such as interactive query engines like Presto and Spark, are quickly gaining in popularity. Other emerging technologies like Mesos and Docker are changing how applications are managed and deployed. Some changes to our big data platform in the last year include:

Upgrading our Hadoop clusters to Hadoop 2.
Moving to Parquet as the primary storage format for our data warehouse.
Integrating Presto into our big data platform.
Developing, deploying and open sourcing Inviso, to help users and admins gain insights into job and cluster performance.

Amidst all this change, we reevaluated Genie to determine what was needed to meet our evolving needs. Genie 2.0 is the result of this work and it provides a more flexible, extensible and feature rich distributed configuration and job execution engine.

Reevaluating Genie 1.0

Genie 1.0 accomplished its original goals well, but the narrow scope of those goals lead to limitations including:

It only worked with Hadoop 1.
It had a fixed data model designed for a very specific use case. Code changes were required to accomplish minor changes in behavior.

As an example, the s3CoreSiteXml, s3HdfsSiteXml fields of the ClusterConfigElement entity stored the paths to the core-site and hdfs-site XML files of a Hadoop cluster rather than storing them as a generic collection field.

The execution environment selection criteria was very limited. The only way to select a cluster was by setting one of three types of schedules: SLA, ad hoc or bonus.

Genie 1.0 could not continue to meet our needs as the number of desired use cases increased and we continued to adopt new technologies. Therefore, we decided to take this opportunity to redesign Genie.

Designing and Developing Genie 2.0

The goals for Genie 2.0 were relatively straightforward:

Develop a generic data model, which would let jobs run on any multi-tenant distributed processing cluster.
Implement a flexible cluster and command selection algorithm for running a job.
Provide richer API support.
Implement a more flexible, extensible and robust codebase.

Each of these goals are explored below.

The Data Model

The new data model consists of the following entities:

Cluster: It stores all the details of an execution cluster including connection information, properties, etc. Some cluster examples are Hadoop 2, Spark, Presto, etc. Every cluster can be linked to a set of commands that it can run.

Command: It encapsulates the configuration details of an executable that is invoked by Genie to submit jobs to the clusters. This includes the path to the executable, the environment variables, configuration files, etc. Some examples are Hive, Pig, Presto and Sqoop. If the executable is already installed on the Genie node, configuring a command is all that is required. If the executable isn’t installed, a command can be linked to an application in order to install it at runtime.

Application: It provides all the components required to install a command executable on Genie instances at runtime. This includes the location of the jars and binaries, additional configuration files, an environment setup file, etc. Internally we have our Presto client binary configured as an application. A more thorough explanation is provided in the “Our Current Deployment” section below.

Job: It contains all the details of a job request and execution including any command line arguments. Based on the request parameters, a cluster and command combination is selected for execution. Job requests can also supply necessary files to Genie either as attachments or via the file dependencies field, if they already exist in an accessible file system. As a job executes, its details are recorded in the job record.

All the above entities support a set of tags that can provide additional metadata. The tags are used for cluster and command resolution as described in the next section.

Job Execution Environment Selection

Genie now supports a highly flexible method to select the cluster to run a job on and the command to execute, collectively known as the execution environment. A job request specifies two sets of tags to Genie:

Command Tags: A set of tags that maps to zero or more commands.

Cluster Tags: A priority ordered list of sets of tags that maps to zero or more clusters.

Genie iterates through the cluster tags list, and attempts to use each set of tags in combination with the command tags to find a viable execution environment. The ordered list allows clients to specify fallback options for cluster selection if a given cluster is not available.

At Netflix, nightly ETL jobs leverage this feature. Two sets of cluster tags are specified for these jobs. The first set matches our bonus clusters, which are spun up every night to help with our ETL load. These clusters use some of our excess, pre-reserved capacity available during lower traffic hours for Netflix. The other set of tags match the production cluster and act as the fallback option. If the bonus clusters are out of service when the ETL jobs are submitted, the jobs are routed to the main production cluster by Genie.

Richer API Support

Genie 1.0 exposes a limited set of REST APIs. Any updates to the contents of the resources had to be done by sending requests, containing the entire object, to the Genie service. In contrast, Genie 2.0 supports fine grained APIs, including the ability to directly manipulate the collections that are part of the entities. For a complete list of available APIs, please see the Genie API documentation.

Code Enhancements

An examination of the Genie 1.0 codebase revealed aspects that needed to be modified in order to provide the flexibility and standards compliance desired going forward.

Some of the goals to improve the Genie codebase were to:

Decouple the layers of the application to follow a more traditional three tiered model.
Remove unnecessary boilerplate code.
Standardize and extend REST APIs.
Improve deployment flexibility.
Improve test coverage.

Tools such as Spring, JPA 2.0, Jersey, JUnit, Mockito, Swagger, etc. were leveraged to solve most of the known issues and better position the software to handle new ones in the future.

Genie 2.0 was completely rewritten to take advantage of these frameworks and tools. Spring features such as dependency injection, JPA support, transactions, profiles and more are utilized to produce a more dynamic and robust architecture. In particular, dependency injection for various components allows Genie to be more easily modified and deployed both inside and outside Netflix. Swagger based annotations on top of the REST APIs provide not only improved documentation, but also a mechanism for generating clients in various languages. We used Swagger codegen to generate the core of our Python client, which has been uploaded to Pypi. Almost six hundred tests have also been added to the Genie code base, making the code more reliable and maintainable.

Our Current Deployment

Genie 2.0 has been deployed at Netflix for a couple of months, and all Genie 1.0 jobs have been migrated over. Genie currently provides access to all the Hadoop and Presto clusters, in our production, test and ad hoc environments. In production, Genie currently autoscales between twelve to twenty i2.2xlarge AWS instances, allowing several hundred jobs to run at any given time. This provides horizontal scaling of clients for our clusters with no additional configuration or overhead.

Presto and Sqoop commands are each configured with a corresponding application that points to locations in S3, where all the jars and binaries necessary to execute these commands are located. Every time one of these commands run, the necessary files are downloaded and installed. This allows us to continuously deploy updates to our Presto and Sqoop clients without redeploying Genie. We’re planning to move our other commands, like Pig and Hive, to this pattern as well.

At Netflix launching a new cluster is done via a configuration based launch script. After a cluster is up in AWS, the cluster configuration is registered with Genie. Commands are then linked to the cluster based on predefined configurations. After it is properly configured in Genie, the cluster will be marked as “available”. When we need to take down a cluster, it is marked as “out of service” in Genie so the cluster can no longer accept new jobs. Once all running jobs are complete, the cluster is marked as “terminated” in Genie and instances are shut down in AWS.

With Genie 2.0 going live in our environment, it has allowed us to bring together all the new tools and services we’ve added to the big data platform over the last year. We have already seen many benefits from Genie 2.0. We were able to add Presto support to Genie in a few days and Sqoop in less than an hour. Theses changes would have required code modification and redeployment with Genie 1.0, but were merely configuration changes in Genie 2.0.

Below is our new big data platform architecture with Genie at its core.

Future Work

There is always more to be done. Some enhancements that can be made going forward

include:

Improving the job execution and monitoring components for better fault tolerance, efficiency on hosts and more granular status feedback.
Abstracting Genie’s use of Netflix OSS components to allow adopters to implement their own functionality for certain components to ease adoption.
Improving the admin UI to expose more data to users. e.g. Show all clusters a given command is registered with.

We’re always looking for feedback and input from the community on how to improve and evolve Genie. If you have questions or want to share your experience with running Genie in your environment, you can join our discussion forum. If you’re interested in helping out, you can visit our Github page to fork the project or request features.

Friday, October 31, 2014

Message Security Layer: A Modern Take on Securing Communication

Netflix serves audio and video to millions of devices and subscribers across the globe. Each device has its own unique hardware and software, and differing security properties and capabilities. The communication between these devices and our servers must be secured to protect both our subscribers and our service.

When we first launched the Netflix streaming service we used a combination of HTTPS and a homegrown security mechanism called NTBA to provide that security. However, over time this combination started exhibiting growing pains. With the advent of HTML5 and the Media Source Extensions and Encrypted Media Extensions we needed something new that would be compatible with that platform. We took this as an opportunity to address many of the shortcomings of the earlier technology. The Message Security Layer (MSL) was born from these dual concerns.

Problems with HTTPS

One of the largest problems with HTTPS is the PKI infrastructure. There were a number of short-lived incidents where a renewed server certificate caused outages. We had no good way of handling revocation: our attempts to leverage CRL and OCSP technologies resulted in a complex set of workarounds to deal with infrastructure downtimes and configuration mistakes, which ultimately led to a worse user experience and brittle security mechanism with little insight into errors. Recent security breaches at certificate authorities and the issuance of intermediate certificate authorities means placing trust in one actor requires placing trust in a whole chain of actors not necessarily deserving of trust.

Another significant issue with HTTPS is the requirement for accurate time. The X.509 certificates used by HTTPS contain two timestamps and if the validating software thinks the current time is outside that time window the connection is rejected. The vast majority of devices do not know the correct time and have no way of securely learning the correct time.

Being tied to SSL and TLS, HTTPS also suffers from fundamental security issues unknown at the time of their design. Examples include padding attacks and the use of MAC-then-Encrypt, which is less secure than Encrypt-then-MAC.

There are other less obvious issues with HTTPS. Establishing a connection requires extra network round trips and depending on the implementation may result in multiple requests to supporting infrastructure such as CRL distribution points and OCSP responders in order to validate a certificate chain. As we continually improved application responsiveness and playback startup time this overhead became significant, particularly in situations with less reliable network connectivity such as Wi-Fi or mobile networks.

Even ignoring these issues, integrating new features and behaviors into HTTPS would have been extremely difficult. The specification is fixed and mandates certain behaviors. Leveraging specific device security features would require hacking the SSL/TLS stack in unintended ways: imagine generating some form of client certificate that used a dynamically generated set of device credentials.

High-level Goals

Before starting to design MSL we had to identify its high-level goals. Other than general best practices when it comes to protocol design, the following objectives are particularly important given the scale of deployment, the fact it must run on multiple platforms, and the knowledge it will be used for future unknown use cases.

Cross-language. Particularly subject to JavaScript constraints such as its maximum integer value and native functions found in web browsers.
Automatic error recovery. With millions of devices and subscribers we need devices that enter a bad state to be able to automatically recover without compromising security.
Performance. We do not want our application performance and responsiveness to be limited any more than it has to be. The network is by far the most expensive performance cost.

Figure 1. HTTP vs. HTTPS Performance
Flexible and extensible. Whenever possible we want to take advantage of security features provided by devices and their software. Likewise if something no longer provides the security we need then there needs to be a migration path forward.
Standards compatible. Although related to being flexible and extensible, we paid particular attention to being standards compatible. Specifically we want to be able to leverage the Web Crypto API now available in the major web browsers.

Security Properties

MSL is a modern cryptographic protocol that takes into account the latest cryptography technologies and knowledge. It supports the following basic security properties.

Integrity protection. Messages in transit are protected from tampering.
Encryption. Message data is protected from inspection.
Authentication. Messages can be trusted to come from a specific device and user.
Non-replayable. Messages containing non-idempotent data can be non-replayable.

MSL supports two different deployment models, which we refer to as MSL network types. A single device may participate in multiple MSL networks simultaneously.

Trusted services network. This deployment consists of a single client device and multiple servers. The client authenticates against the servers. The servers have shared access to the same cryptographic secrets and therefore each server must trust all other servers.
Peer-to-peer. This is a typical p2p arrangement where each each side of the communication is mutually authenticated.

Figure 2. MSL Networks

Protocol Overview

A typical MSL message consists of a header and one or more application payload chunks. Each chunk is individually protected which allows the sender and recipient to process application data as it is transmitted. A message stream may remain open indefinitely, allowing large time gaps between chunks if desired.

MSL has pluggable authentication and may leverage any number of device and user authentication types for the initial message. The initial message will provide authentication, integrity protection, and encryption if the device authentication type supports it. Future messages will make use of session keys established as a result of the initial communication.

If the recipient encounters an error when receiving a message it will respond with an error message. Error messages consist of a header that indicates the type of error that occurred. Upon receipt of the error message the original sender can attempt to recover and retransmit the original application data. For example, if the message recipient believes one side or the other is using incorrect session keys the error will indicate that new session keys should be negotiated from scratch. Or if the message recipient believes the device or user credentials are incorrect the error will request the sender re-authenticate using new credentials.

To minimize network round-trips MSL attempts to perform authentication, key negotiation, and renewal operations while it is also transmitting application data (Figure 2). As a result MSL does not impose any additional network round trips and only minimal data overhead.

Figure 3. MSL Communication w/Application Data

This may not always be possible in which case a MSL handshake must first occur, after which sensitive data such as user credentials and application data may be transmitted (Figure 3).

Figure 4. MSL Handshake followed by Application Data

Once session keys have been established they may be reused for future communication. Session keys may also be persisted to allow reuse between application executions. In a trusted services network the session keys resulting from a key negotiation with one server can be used with all other servers.

Platform Integration

Whenever possible we would like to take advantage of the security features provided by a specific platform. Doing so often provides stronger security than is possible without leveraging those features.

Some devices may already contain cryptographic keys that can be used to authenticate and secure initial communication. Likewise some devices may have already authenticated the user and it is a better user experience if the user is not required to enter their email and password again.

MSL is a plug-in architecture which allows for the easy integration of different device and user authentication schemes, session key negotiation schemes, and cryptographic algorithms. This also means that the security of any MSL deployment heavily depends on the mechanisms and algorithms it is configured with.

The plug-in architecture also means new schemes and algorithms can be incorporated without requiring a protocol redesign.

Other Features

Time independence. MSL does not require time to be synchronized between communicating devices. It is possible certain authentication or key negotiation schemes may impose their own time requirements.
Service tokens. Service tokens are very similar to HTTP cookies: they allow applications to attach arbitrary data to messages. However service tokens can be cryptographically bound to a specific device and/or user, which prevents data from being migrated without authorization.

The Release

To learn more about MSL and find out how you can use it for your own applications visit the Message Security Layer repository on GitHub.

The protocol is fully documented and guides are provided to help you use MSL securely for your own applications. Java and JavaScript implementations of a MSL stack are available as well as some example applications. Both languages fully support trusted services and peer-to-peer operation as both client and server.

MSL Today and Tomorrow

With MSL we have eliminated many of the problems we faced with HTTPS and platform integration. Its flexible and extensible design means it will be able to adapt as Netflix expands and as the cryptographic landscape changes.

We are already using MSL on many different platforms including our HTML5 player, game consoles, and upcoming CE devices. MSL can be used just as effectively to secure internal communications. In the future we envision using MSL over Web Sockets to create long-lived secure communication channels between our clients and servers.

We take security seriously at Netflix and are always looking for the best to join our team. If you are also interested in attacking the challenges of the fastest-growing online streaming service in the world, check out our job listings.

Wesley Miaw & Mitch Zollinger
Security Engineering