LinkedIn Engineering

Measuring and Optimizing Performance of Single-Page Applications (SPA) Using RUM

Sreedhar Veeravalli — Thu, 02 Feb 2017 10:14:00 -0800

Introduction

Improving site speed is one of the major technology initiatives at LinkedIn because it is highly correlated with the engagement members have on our website. Real User Monitoring (RUM) is an approach where we use data from real users, instead of a synthetic lab environment, to measure performance, and this is the primary way we measure site speed at LinkedIn. In this blog post, we present how we measure performance of single-page applications at LinkedIn using RUM, and how we used RUM data to make our new LinkedIn web application faster by 20%.

Page Load Time (PLT) is the key metric that we measure for every page at LinkedIn in order to capture the user’s perception of when the page is ready. It is not easy to measure this metric uniformly across all pages because it is highly subjective depending on the content of the page and on the end user’s perception. Speed Index is a good indicator for the user’s perception of when the page is rendered, and we use it for measuring performance in synthetic environments. But it is not possible to calculate this metric accurately using RUM. So, we needed to find a good proxy for PLT that could be easily measured using RUM. For traditional web applications, which are mostly server-side rendered, window.onload() event is a reasonably good proxy for PLT. Most traditional RUM libraries use the Navigation Timing API to detect when window.onload() event is fired and thereby measure PLT for the page. However, as we discuss below, we ran into several obstacles when trying to use window.onload() as a proxy for PLT for single-page applications.

Single-page applications

Single-page applications (SPA) are web applications built using a Javascript MVC framework, like Ember, to deliver a rich, app-like experience. The HTML for these pages is mostly built on the client browser instead of the web server. Another thing to note is that any page/URL in these applications can be visited in two modes:

App launch: This occurs when the app is initially loaded by entering the URL in the browser or by clicking on an email link. “App launch” mode is typically slow, as the application (JS/CSS) needs to be downloaded and booted before doing the work to render the page.
Subsequent: This occurs when the app has already been loaded and the page is visited by clicking on a link within the app. “Subsequent” mode is typically fast because the application is already downloaded and booted, and we just need to fetch new data for the page and render it.

As LinkedIn web applications moved from traditional server-side rendered web pages to modern single-page applications, we faced many challenges in using the window.onload() event for measuring performance of these applications using RUM.

When a page is visited in “app launch” mode, the window.onload() event is fired too early relative to when the user sees meaningful content on the page, as illustrated in the image below. window.onload() event is geared more towards resource download times instead of rendering times, and when most of the rendering happens on the client side, it does not represent PLT accurately.

When a page is visited in “subsequent” mode, the window.onload() event is not fired at all because there is no new HTML document being downloaded. We cannot track the performance of these page views if we rely on the window.onload() event.

RUM for single-page applications

We need to automatically detect the start of a “subsequent” visit and detect when the page has been rendered to measure the PLT for both “app launch” and “subsequent” navigations. When considering our options, we looked at the different ways to do so:

Using Resource Timing API to detect when an AJAX call has been made to identify the start of a “subsequent” visit.
Using MutationObserver to detect changes in DOM and detecting end of network activity using Resource Timing API.

These approaches were either unreliable or inaccurate for our use case. For example, we make AJAX requests to pre-fetch some data without the user initiating a “subsequent” visit. Also, there might be network activity happening that does not impact user experience. We want PLT to closely match when the user has seen content on the screen, instead of network activity.

We figured out that the most reliable way we could detect these events is by letting the application tell us when they’ve happened by using a simple API. Below is an example of how the application would use the API to mark these events.

The main problem with this approach is that each page would have to write the instrumentation code, which is very tedious. Thankfully, some SPA frameworks, like Ember, have rich lifecycle hooks that we can use to add this instrumentation automatically.

Below is an example of how we automate the instrumentation for most cases in our Ember addon.

We know the start of “app launch” navigation based on navigationStart from Navigation Timing API. We listen to the router’s willTransition event to detect the start of a “subsequent” navigation. We also listen to the router’s didTransition event and add work in the afterRender queue to notify that the page has been rendered for both “app launch” and “subsequent” navigations. We now have the start and end times of page renderings for both “app launch” and “subsequent” navigations and can therefore calculate PLT based off of these times.

With this approach, we have automated the measurement of PLT for many single-page applications within LinkedIn that are built on top of Ember, AngularJS, and Marionette frameworks.

Granular metrics for finding optimization opportunities

If we want to debug any performance issues, we need to collect more granular metrics to breakdown the PLT and understand where the bottlenecks are. Traditional RUM libraries would rely on Navigation Timing API and Resource Timing API to provide these granular metrics about the HTML and resource download times. But single-page applications spend a significant amount of time on the browser doing JavaScript execution after the main HTML/JS/CSS are downloaded. In order to identify the key milestones or phases during this JavaScript execution, we added granular metrics using the User Timing API. Once we added these granular metrics, we were able to construct the waterfall below, which helps us visualize bottlenecks easily.

Waterfall for a typical “app launch” page in a single-page application

When we launched our new LinkedIn web application built on Ember, we analyzed the RUM data to find optimization opportunities. Below are a couple of optimizations that we implemented based on RUM data.

Occlusion culling (lazy rendering)

When we looked at the granular metrics charts, we found that about 30% of the page load time was being spent in the “render” phase. The render phase is where we build the DOM for the components on the page after the data is available. We also noticed that the browser’s main thread is not yielded for a paint until the DOM is created for all the components on the page.

If we yield the browser main thread earlier, after the DOM is created for only components that are in the viewport, we can have a much faster user experience, as we do not need to render components below the viewport before the browser does a paint. We refer to this optimization technique, where we avoid/defer rendering of content outside the viewport, as “occlusion culling.”

At a high level, the major performance issue here is that all the components on the page have the same priority and all of them are rendered/painted at the same time. So we developed a solution where we can give different priorities to components based on when they need to be rendered. This is illustrated in the image below:

The first category is the blue-colored components, which are marked as “non occludable.” They have the highest priority and are always rendered on the page irrespective of the viewport height. We keep all the components that fit in the typical viewport in this category.

The second category is the green-colored components, which are marked as “occludable” but have lesser priority and might be rendered incrementally in the next animation frame, depending on these two conditions:

If the viewport height is longer than a typical viewport, these components will be rendered;
If the component is configured to pre-render for smooth scrolling experience, even if it is outside the viewport.

The third category is the red-colored components, which are marked as “occludable” and are not rendered (culled) initially because they are outside the viewport. They will be rendered as we scroll and they enter into the viewport.

Using this approach, we incrementally render/paint components and can have a faster First Meaningful Paint of the content in the viewport. Additionally, since we avoid doing the work to render some components that are outside the viewport, we can also have a faster Time To Interactive. When we implemented this optimization for some pages, we noticed that the time spent in render phase improved by 50% at both the 50th and 90th percentile.

Lazy data fetching

After we improved the render phase, we noticed that 20% of the page load time was being spent in the “Transition” phase, according to the granular metrics charts. The transition phase is where we wait for the data to arrive, normalize the data, and then push it to Ember data store. This work is heavily dependent on the amount of data that needs to be processed. If we lazily fetch data that is not needed for First Meaningful Paint, we can reduce the amount of data that needs to be processed and have a faster First Meaningful Paint. So we split the data fetching process into two calls at a high level: one to fetch data needed for First Meaningful Paint and the other to fetch the remaining data needed for the page. We applied this optimization for some pages and noticed that the transition phase improved by up to 40%.

Another advantage of the optimizations described above is that they apply to both “app launch” and “subsequent” navigations because these optimizations are in the time spent on the client browser after the application has been downloaded and booted.

Lessons learned

These are just some of the optimizations which were implemented and validated using RUM data as we ramped our new LinkedIn web application. Some best practices and lessons learned as we went through this journey are:

Defer work not needed for First Meaningful Paint. This principle yielded us good results in many cases, including the optimizations described above. There were many other optimizations, like lazily loading application code (JS/CSS) not needed for First Meaningful Paint, that also fall under this category.
Analyze RUM data thoroughly in addition to performance data obtained from synthetic and development environments. There are a wide variety of devices, networks, and users in the real world, and it is not easy to simulate all of them in a synthetic environment. In many cases, the optimizations in the real world were significantly higher than what we noticed in a synthetic environment.
Always do an A/B experiment to accurately measure the gains from a given optimization. Since we are running many optimization experiments in parallel, we cannot attribute the gains accurately to a specific experiment without putting the optimization behind an A/B experiment. This also helps us in informing how much gain the same experiment would give in other pages and applications. It is also a good idea to continue to do these experiments periodically to see how they are performing, as bottlenecks shift over time.
To iterate quickly on new ideas, we should validate and refine them in a synthetic performance testing environment. We have a synthetic performance testing framework where we can do many runs of a test in an isolated environment to validate and refine the idea before pushing out to production. This framework has been very useful for iterating quickly on new ideas.

Acknowledgements

We would like to thank David He, Ruixuan Hou, and Krati Ahuja for helping in the design and implementation of our RUM solution for single-page applications. We’d also like to thank Chris Thoburn, Chad Hietala, and Jeba Singh Emmanuel who helped in implementing the optimizations described in this blog. And finally, thanks to Ritesh Maheshwari for his input and feedback while writing this blog post.

Getting to Know Russ White

Zaid Ali Kahn — Fri, 27 Jan 2017 08:03:00 -0800

LinkedIn wouldn't be the company it is today without the engineers who built it. We have no shortage of talented individuals in technical roles across the company. They are the ones who create, build, and maintain our platform, tools, and features—as well as write posts for this blog. In this series, we feature some of the people and personalities that make LinkedIn great.

Russ White is on LinkedIn’s Infrastructure Engineering team, working on next-generation network design and architecture. He has worked in networking since the late 1980s, and has a long history contributing knowledge back to the networking community. A short list of his contributions to the field include being a published author, having published several papers in the Internet Protocol Journal, and serving in the past as a an organizational council co-chair for the Internet Society. He is also a member of the IETF Routing Area Directorate.

Before joining LinkedIn in August 2015, Russ worked in networking at several major technology companies, including Ericsson, Verisign, and Cisco. He holds a master’s in information technology and network design, a master’s in biblical literature and theology, and is currently completing a doctorate in philosophy (apologetics and culture). He also maintains a personal blog, 'net Work.

What are some of the coolest projects that you and your team have been working on?

Right now, we’re working on building a control plane for Project Falco. It’s exciting to think about how to build a control plane that interacts with LinkedIn’s business specifically, rather than simply buying vendor-driven architectures and trying to shape them to fit our needs. At some point, it would be great to be able to give part of the control plane we’re building to the community so that it could have a positive impact on the way companies build their networks.

A second project I’m working on is improving our edge security. Edge security is always fun because of the community aspect, and the blend of security and routing.

What other projects are you involved in outside of working on LinkedIn’s network?

I’ve been involved with the Internet Engineering Task Force (IETF) for about 20 years now. My involvement in the IETF currently includes serving on the Routing Area Directorate, where I review drafts and act as a general “helping hand” for the Routing Area Directors wherever it is needed. I have, in the past, served as a co-chair for working groups in the area of routing protocols security, and participated in the writing and editing of a number of internet standards. Right now I’m also co-chairing two working groups, one for a routing protocol called BABEL, and another for I2RS, the Interface to the Routing Systems. I also serve as a liaison between LinkedIn and the Internet Society (ISOC), which is the governing body for IETF and and the Internet Research Task Force (IRTF), and on the technical advisory board at SDxE, a technical show focused on the software-defined enterprise. I am also active, where possible, in a number of user groups, such as NANOG and LACNOG.

In addition, I’m currently working on a book with Pearson about networking basics. Not another book on network basics, right? The idea here is completely unique in the introductory networking space, and deeply technical. This will be my eleventh book on technology published by Pearson, and will be targeted at the collegiate market.

Compared to other places you've worked, how do you like working at LinkedIn?

I’ve worked on a lot of enterprise networks, even at a large scale, and at a number of large vendors. What’s interesting to me about working at LinkedIn is the focus on hyperscale and moving towards engineering our own network solutions rather than only considering vendor options. We’re not just going out and buying a vendor solution; we’re thinking about what we need specifically and then building something that fits our environment.

What are your favorite things to do when you’re not at the office?

I’m currently pursuing a doctorate in apologetics and culture at Southeastern Baptist Theological Seminary. This degree falls under the philosophy department, so I’m reading a lot of works on ontology, epistemology, and the mind/body problem. My specific area of work is around technology and the human person, particularly in the areas of privacy and various forms of social engineering. I’m right at the intersection of philosophy, culture, technology, and Christian belief. I also teach classes for local homeschoolers, in addition to writing the technical books mentioned above. It can be tough to balance everything, but it helps that I don’t watch television, nor am I a heavy user of social media. I don’t really consider LinkedIn to be social media so much as professional media, so LinkedIn is the one exception.

ODP: An Infrastructure for On-Demand Service Profiling

Tao Feng — Tue, 24 Jan 2017 10:23:00 -0800

Coauthors: Tao Feng, John Nicol, Chen Li, Peinan Chen, Hari Ramachandra

LinkedIn has built hundreds of application services, with thousands of instances running in data centers. Optimizing the performance of these services can dramatically improve user experience and reduce operational costs, and profilers are commonly used to help achieve this. LinkedIn’s On-Demand Profiling infrastructure (“ODP”) is one method we use to identify these optimizations.

Introduction

Profiling is a useful method to improve the performance of services. However, the tooling solutions for profiling don’t have fixed standards, are often decentralized, can be costly, and for a company with a large server footprint such as LinkedIn, are inconvenient to use at best.

For example, unless a profiler is supported internally, users may need to configure, acquire licenses, and request installation on remote hosts before profiling. In addition, viewing the profiled results often requires manual data transfer or setting up a tunnel from the production environment to the development environment. Lastly, comparing historical data is difficult or even impossible, especially when profiling runs are captured by different users or different profiling tools.

ODP is our tooling infrastructure to address these pain points. It allows users to debug service performance issues with little manual effort. It also centralizes profiling data so the data can be shared, archived, and compared with other profiling events; this data sharing also allows known issues to be automatically identified. Moreover, this profiling can be scaled for thousands of services across LinkedIn’s data centers. Additionally, this is a plugin-based infrastructure, which can be extended to include memory allocation, thread status, profilers for other languages, and more.

The generality of this approach is useful, but we’ve found few profiling tools that can effectively be used with it so far. For now, we’ve developed our own JVM CPU sampling profiler and are investigating profilers for other languages. These profilers are secondary to the tooling infrastructure and may be replaced with future industry standards, but for now, they have proven themselves to be quite effective when used with the overall framework.

In this post, we describe the overall architecture of ODP and how ODP helps find performance issues with LinkedIn services.

On-demand profiler architecture

The following diagram shows the overall architecture of ODP.

At a high level, here’s what happens:

A user or a scheduled job requests a service be profiled on a specific host.
This request is passed to a REST-based API server. That server deploys the profiler if necessary (if another service on the host has used the profiler already, we don’t need to re-deploy the profiler but can instead reuse that same profiler) and then signals the profiler to attach to the specified service.
The profiler sends its data through a scalable pipeline. After post-processing, it can be publicly viewed on a web-based GUI.

Profiling requests
Profiling requests can come from both users and approved services (e.g., automated testing).

In addition to profiling a service on demand, users can also schedule to profile during regular events, such as traffic shifts. For flexibility, the framework supports these requests coming in from anywhere—as long as they’re authenticated.

REST-based API server
Our REST-based API server serves the scheduled and unscheduled start/stop profiling requests. The server checks whether the profiler is already deployed, and if not, deploys it. Then it tells the profiler to attach to the specified service via a Kafka message.

JVM profiler
The current profiler we have is a sampling profiler based upon JMX. It can connect and profile any JVM-based applications on the same host with no disruption.

The general workflow of the profiler is to collect stack traces and elapsed CPU time for each Java thread via MXBeans at regular intervals, and then to post-process the data in a separate thread. The data is aggregated and periodically sent to Kafka.

Samza
Samza is the scalable streaming platform used within LinkedIn. Even if multiple profilers send data simultaneously to the Kafka topic, Samza can catch up with the produced messages and push them to our remote data store. Samza also provides the benefit of no data loss. We use Samza to pull the profiling data from Kafka and push it to a MySQL database.

GUI/web application
The profiling data is visualized through a web application built with ember.js, flask and flamegraph technology. Flamegraph originates from Brendan Gregg in Netflix; it’s a visualization technique for CPU stack traces. The flamegraph is rendered as interactive SVG, which allows the user to zoom in and zoom out of stack traces easily. The way to read a flamegraph is to read each cell in a given layer as a method call, and the cells in the layer above it as its children method calls. Thus, the highest cells are the deepest method calls.

Performance debugging portal

For each profiling request, the user gets a unique page with:

Information about the profiled service, plus optional comments;
Different display modes, such as sample counts or CPU time;
Widgets to help the user debug performance issues, including top hot leaf methods, leaf-first view, highlight, filter, known issues, thread status, and others;
A flamegraph for the sampled stack traces.

Highlight or filter

The stack traces can be overwhelming to the user. We need to make them manageable, and allow the user to focus on specific sections. To do this, we introduced highlighting and filtering functions. The user can highlight or filter any searchable string (for example, a method name, package name, line number, or a regex combination of those). The flamegraph will re-render and only list the stack traces containing the string in a method name and highlight the specific matches.

Leaf-first mode

The above diagram shows the leaf-first mode, which reverses the stack traces (callee listed at bottom, caller at top). This helps developers spot the code paths where the majority of time is spent. For example, this service spends a significant portion of time in the leftmost stack traces; this may indicate a bottleneck.

Top hot leaf methods

The GUI lists the top hot leaf methods for the profiling event shown above.

If the user is interested in a certain leaf method, clicking the method name will automatically filter out the stack traces pertinent to that method, as shown above.

Thread states visualization

We provide a visualization of thread states at each sampling point. Users can easily find if there’s any thread contention issue and see which method is blocking.

Comparison

Users can also compare profiles. In the figure above, the red color shows an increase of CPU samples compared to the baseline, while the blue color shows a decrease.

Automatic integration with performance test frameworks
The API server provides endpoints to allow trusted services to start/stop profiling events automatically. Some services have test frameworks to catch performance regression issues; we’ve integrated with one such internal system already. The internal framework makes profiling requests to ODP during its performance test runs, providing profiling data that helps find performance regressions.

Performance improvements

In the last few months, many performance improvements and fixes have been made across the LinkedIn stack through usage of ODP. Detecting the bad apples, especially in commonly-used library code, has been a huge win through the profiler, leading to reduced latency and/or CPU usage in many services. Although the issues encountered are varied, some common patterns have emerged:

Exception handling:
JVM exception handling can be slow (orders of magnitude slower on occasion)—but in other cases, the effect is negligible.

Reflection:
Using Reflection in the JVM can be slow. This slowness can show up in surprising places; even getting a class name can have an effect.

Logging:
Logging is a very common event in services, and it’s expected to be cheap. However, old logging frameworks, short-lived logger objects, and function evaluations during logging have all been found to sometimes have performance effects.

Summary

In this post, we’ve given an overview of ODP, our infrastructure for on-demand service profiling. We’ve described its architecture, its features, and some of the wins that LinkedIn has already experienced through the use of this framework. We hope that you see the benefit of such a framework, and can apply something similar for your own systems.

Acknowledgments

The development and use of ODP at LinkedIn has been a significant cross-team effort. We wish to thank Brandon Duncan, Josh Hartman, Haiying Wang, Kumar Pasumarthy and Jason Johnson (and their respective teams)... and of course, all the users of the framework.

Great Tools for Engineers: Refactoring Across Multiple Code Bases with Gradle and IntelliJ IDEA

Szczepan Faber — Fri, 20 Jan 2017 09:03:00 -0800

LinkedIn engineers require tooling that scales really well, and we never stop improving it. Even at a smaller scale, providing great tools for engineers is key to winning business and retaining top talent. This post is about working with code that lives in many separate code repositories, while still being productive and efficient in the process!

Repository and SCM agnostic development at LinkedIn

At LinkedIn, we want to create a development environment where the underlying SCM (source control management) system becomes an implementation detail. We built an abstraction layer on top of the source code repositories called “multiproduct.” Think of a multiproduct layer as a code base that is a unit of building, testing, and releasing some software, such as a web app, a service, or a set of software libraries. Multiproducts at LinkedIn can be checked out, and a new feature can be implemented and submitted for review, without interacting directly with the SCM system. A software project, wrapped with multiproduct abstraction, can have the source code stashed in multiple separate code repositories. At the moment, this is a standard scenario for certain kinds of projects at LinkedIn, like services that keep configs in a separate repository or open source wrapper projects that keep public code in GitHub. Multiproduct abstraction does not completely override interactions with SCM system. On a daily basis, engineers rebase their branches, work with revision history, and interact with SCM system in other ways. More about the multiproduct concept will be covered in future articles. Jens Pillgram-Larsen’s post “Find the Seams” is a great introduction.

LinkedIn's multiproduct abstraction presents a boundary between code bases. The challenge emerges when a code change needs to span this boundary. For example, when a common library developed in a separate multiproduct changes its API, clients that use it (have a binary dependency on the library) need to be updated. So, there must be a change in the library code (multiproduct A) and in the client code (multiproduct B). This generates more overhead: one change in the library requires the new version to be published, and then another change is needed in the client code so that it uses the new version of the library. The result is more independent changes, manual ordering work, more CI builds, and binary version management.

There are, however, benefits of the boundary between multiproducts. Working with software components integrated at a binary level via dependency management enables teams to move faster by making it possible to ship ambitious, incompatible changes without the need to update all consumers at once. In multiproduct, thoughtful design of the public API is a necessary practice, and this positively affects the overall architecture. Each team needs to continually care about dependency management to avoid having a complicated dependency graph, unnecessary dependencies, and dependency cycles. It is more work at first, but it pushes the teams harder to keep the architecture clean and healthy.

LinkedIn engineers collaborating on a project at one of our regular developer bootcamps

Open source partnerships for a better build experience

At LinkedIn, we want it all: a clean architecture of software components integrated at the binary level and the convenience of cross-multiproduct refactoring. In order to deliver this capability, we sponsored a new feature in the Gradle build system called Composite Builds. In addition, the LinkedIn Development Tools team updated our custom IntelliJ IDEA plugin to take advantage of Gradle Composite Builds and deliver a great developer experience.

Gradle is a robust and extensible open source build platform. At LinkedIn, Gradle serves as the core of the build automation framework for all software stacks we support: JVM, Android, iOS, C++, JavaScript/Web, Python, and more. We recently open sourced support for Python with Gradle—try it out and enjoy Gradle’s amazing dependency management in the Python world! Driving automation of major technology stacks at LinkedIn using the same build tool gives engineers a consistent developer experience. It also reduces the maintenance cost of the build infrastructure by reusing our custom LinkedIn Gradle plugins.

At LinkedIn, we leverage our open source partnership with Gradle Inc. to sponsor new features in Gradle. The partnership gives LinkedIn an opportunity to further enhance our internal build automation experience while adding value to the entire Gradle community. The new features become part of the standard Gradle distribution, free and open sourced for the community to use.

The Gradle Composite Builds feature “connects” separate Gradle builds into one single, cohesive build. This means that we can treat separate multiproducts as a single multiproduct, built with a single Gradle invocation. A critical feature of Composite Builds is the ability to integrate with IntelliJ IDEA—a very popular IDE. We can import separate multiproducts into a single IntelliJ IDEA window. This allows cross-repository IDE experience: very convenient code navigation and cross-multiproduct refactoring.

To maximize the developer experience at LinkedIn, we have added support for cross-multiproduct development into the LinkedIn IntelliJ IDEA plugin. An IDE is the place where engineers spend most of their time, so adding features into the IDE has huge opportunities to boost engineering productivity at a large organization. We recognize this at LinkedIn, so custom IDE integration via plugins is a key component of our engineering tooling roadmap. If you want to see a demo of how cross-multiproduct refactoring is implemented at LinkedIn, check out the first seven minutes of this video from the Gradle meetup LinkedIn hosted in November 2016.

We believe Gradle Composite Builds is an example of a disruptive change LinkedIn can make by partnering with open source companies to improve developer productivity. The added benefit here is not only for those of us at LinkedIn, but also for all organizations that use Gradle. Never before has it been so easy to setup the IDE to work with code from separate code bases!

Open Sourcing Bluepill: Run iOS Tests in Multiple Simulators

Keqiu Hu — Wed, 18 Jan 2017 07:41:00 -0800

Testing is a key component of LinkedIn’s 3x3 strategy. As we continue improving our iOS continuous delivery pipeline, we are faced with two major obstacles—tooling stability and scalability. We needed a tool to run iOS UI tests both reliably and quickly. For this reason, we created a project, called Bluepill, that we are open sourcing today. Bluepill is a reliable iOS testing tool that runs UI tests using multiple simulators on a single machine. Bluepill has saved LinkedIn thousands of developer hours, and we believe it can also provide a great benefit to anyone running iOS UI tests at scale.

Existing Limitations

There are two major limitations with the standard, out-of-the-box iOS tooling: stability and scalability.

Stability
As with other companies doing iOS development and testing at scale, we faced many challenges dealing with iOS simulator stability. In a blog post last year, we elaborated on how we dealt with simulator flakiness by experimenting with different environment configurations in order to find an optimal workaround. However, since iOS Simulator is a black box and keeps evolving with every Xcode update, we were always chasing a moving target in terms of stability. We gradually came to accept that no matter how robust or resilient it appeared to be in any given version, we were still fragile and unprepared for any volatility from future changes in iOS Simulator.

Scalability
Xcode only supports one simulator at a time. Therefore, tests must be run sequentially. In the case of the LinkedIn app, which has around 2,000 UI tests, this would take about 15 hours. In order to achieve a commit-to-publish time under three hours, we must run the tests in parallel.

Existing solutions

As we looked to solve these problems, we found two existing solutions, neither of which ultimately met our needs adequately.

1. Distributed testing
The first stab we took to scale iOS UI testing was to divide the test target into a subset of targets and then distribute them among different machines. As mentioned in the post iOS Build Speed and Stability, we rolled out distributed building and testing support for our products. However, there were two problems with this approach:

Tooling stability: The tooling stability on our Mac machine pool is around 98%. If we distribute the test target between 10 machines, a build can only pass when all 10 child jobs succeed. With this approach, the tooling flakiness is exacerbated, since each additional node consumed exponentially increases the chance of hitting the flaky failure scenario. With 10 nodes running tests, the tooling reliability drops to 98%10 = 82%.
Capacity requirement: The second problem with hardware parallelization is capacity. At peak time (e.g., during lunch and dinner rushes), we have around 80 concurrent continuous integration jobs. Running them with hardware parallelization would require 80*10 = 800 machines. When we approached the problem, we only had a fourth of that capacity available. The excess jobs had to be queued, and developers’ commits would have to stay in the queue for hours before being tested.

2. Project Hydra: a Python wrapper to run tests in multiple simulators
The first stab we took to address the distributed testing problem was an initiative to run tests in multiple simulators. Inspired by Facebook’s xctool and a proof of concept of running iOS tests on multiple simulators from Johannes, we built a Python wrapper on top of xctool to run iOS tests on multiple simulators. This approach helped stabilize our continuous delivery environment. However, we found several problems with this approach:

It was based on xctool: xctool is a great tool to make it easy to test iOS products. However, active development on xctool was stopped, and the project owners no longer maintain it. This left us with only two choices: either fork and refactor xctool, or build our own testing tool. After some investigation, we found it was easier to build a simpler tool to focus on running tests in multiple simulators.
It was a mere Python wrapper of xctool binary: The Python wrapper was built on top of the xctool binary and it didn’t have access to the CoreSimulator APIs. Managing simulators was difficult, since we couldn’t talk to the simulators directly.

Introducing Bluepill

As existing solutions couldn’t satisfy our requirements, we decided to build our own iOS UI test runner to execute tests in multiple simulators. The tool is written in Objective-C, and is built on top of Apple’s CoreSimulator framework.

The project name Bluepill is inspired by the The Matrix's “blue pill,” which represents an illusion. Bluepill creates an illusion that tooling just works magically, so that engineers can focus on coding.

Bluepill runs tests in parallel using multiple simulators. The main features supported are:

Running tests in parallel by using multiple simulators.
Automatically packing tests into groups with similar running time.
Running tests in headless mode to reduce memory consumption.
Generating a junit report after each test run.
Reporting test running stats, including test running speed and environment robustness.
Retrying when the Simulator hangs or crashes.

Here’s a demo of Bluepill in action:

Using Bluepill

It is quick and easy to start using Bluepill! In a simplified scenario, you just need to run the following command, and Bluepill will kick off four simulators to run your tests in parallel. By the end of the test run, it will generate a report in ./output.

./bluepill -a ./Sample.app -s ./SampleAppTestScheme.xcscheme -o ./output/

Alternatively, you can have a configuration file like the one below:

And run ./bluepill -c config.json

A full list supported options are listed here.

Open sourcing Bluepill

We’re more than happy to announce that Bluepill has been open sourced under the BSD 2-Clause license and that the code is available on Github. Contributions and suggestions are welcome!

Acknowledgements

Bluepill was created by Ashit Gandhi, Jarek Rudzinski, Keqiu Hu, and Oscar Bonilla. It was inspired by parallel iOS test and Facebook’s xctool. The Bluepill icon was created by Maria Iu.

Failure is Not an Option

Benjamin Purgason — Mon, 16 Jan 2017 08:19:00 -0800

This is the final post of the series “Every Day Is Monday in Operations.” Throughout this series we’ve discussed our challenges, shared our war stories, and walked through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.

If Operations fails, so does your company—it is, as you might say, mission-critical. The delicate trust a company holds with its customers can be shattered by a single sustained outage. Just look at one outage from 2012 which cost millions of dollars per minute while a bug was in production. When a major incident occurs, no customer cares about who was at fault, the extenuating circumstances, or how you’re going to do better next time. They leave, wondering why they trusted your company in the first place. In Operations, failure is not an option.

As we’ve retold our war stories over the course of this series, we can’t help but be reminded of this heated exchange from the movie “Apollo 13.” It’s a variant on a conversation we found ourselves having during many of our own major incidents: it doesn’t matter if it looks impossible; we must find a way to succeed.

David and I believe that the incredible combination of expectations, difficulty, and risk associated with each outage warrants the guidance we’ve put forward in this series. We’ve learned these lessons the hard way over the years and hope that by sharing them, readers can gain the benefits without the pain.

This, then, is the greatest gift we can give back to our colleagues: our experiences, the lessons learned, and the axioms we’ve developed from our 45 combined years of experience as Operations leaders.

Implementing a culture of reliability

When looking to apply these lessons yourself, there is some good news: you can start anywhere. For the most part, each of the axioms stands on its own, without relying on the others for merit. That said, I highly recommend starting with what gets measured gets fixed. Once you understand the problems, you can begin fixing them.

Whether you are the senior member of your team or new to the industry, these axioms will serve you well. Your tenure experience, or years of service do not matter. By applying these 10 lessons, teams can create a culture of reliability backed by engineers who can think on their feet, are empowered to make the difficult calls, and who can scale their group farther than you can imagine.

Conclusion

Understanding the 10 axioms we’ve presented is simple; implementing them can be a bit more challenging. This isn’t rocket science, but every day is Monday in Operations, and that means you have to overcome constant changes. You are only as good as your lieutenants, and to make site up your highest priority, you’ll need their help. Also, don’t assume that everyone is on the same page when you start to implement these lessons—you need to communicate, communicate, communicate about the goals and processes you want to use to achieve them.

Operations is a team sport, and if you are at a loss for where to start, remember: what gets measured gets fixed. If you can only measure two things, remember that MTTD and MTTR are key. To improve how you handle the site incidents you do discover after measuring, you must attack the problem, not the person. When it’s time to attack the problem and get to the source, look to the code, because the code doesn’t lie.

Finally: do your best and never give up.

To ask us questions about this post or the entire “Every Day is Monday” series, please join the conversation here. We’ll be checking in on the comments throughout business hours, so let us know what you think!

Getting to Know Greg Leffler

Ben Lai — Fri, 13 Jan 2017 09:18:00 -0800

Greg Leffler works on the editorial team in New York City, and is responsible for LinkedIn’s news coverage and curating content for Software Engineers. Before he took on this role, he was a Senior Manager for the Site Reliability Engineering team, and was also in charge of the interviewing process for SREs. Greg started at LinkedIn in 2012 as an SRE, and since then has supported nearly every part of the site, from traffic management to our backend data stores to address book import.

Greg went to the University of Louisville and then earned a master’s degree in Industrial/Organizational Psychology from Old Dominion University. His first engineering job was at eBay in San Francisco, then he moved to LinkedIn and has been here ever since.

I made friends with a robot with a rainbow afro, because that’s what you do in Japan

What are some of the coolest projects that you and your team have been working on?

We have some really exciting changes to the desktop experience and to how we find and get content in front of our members that you’ll be able to see very soon. Those changes include new ways to see extremely tailored content about niche topics. We’ll be the source for professional news, even in deeply technical areas.

What other projects are you involved in outside of your work on the editorial team?

In addition to writing content and to scouring the web for good stuff, I also help edit the Engineering Blog you’re reading now. Since I was previously an engineer and am now an editor, it was a pretty natural fit. I also work to help maintain the R&D culture in the New York office, making sure that engineers and product people feel connected to headquarters and that we have similarly-awesome activities to bring us together.

What are your favorite kinds of engineering stories to read on the blog?

I like stories that focus on solving a problem and explaining how it was done, with examples. I love reading incident postmortems or reading about infrastructure scaling and growth. One of my recent favorites is the “Every Day is Monday in Operations” series, mainly because I felt like I was sharing in the story and learning about the challenges for the first time.

What is the most challenging part of your job?

Figuring out what people want to read now, and—more importantly—will want to read in the future. I try to produce at least one long-form piece a week, and sometimes thinking about what will be the hot topic or what will get engineers talking takes a lot more time than I’d expect!

Compared to other places you've worked, how do you like working at LinkedIn?

I like working at LinkedIn because many of the leaders embrace our value of taking intelligent risks—my move to New York, my promotion to management, and my most recent career change to editorial were all intelligent risks that ultimately made LinkedIn more successful. At other places where I’ve worked, I feel like these things would have been more difficult or impossible to have happen, but here it seemed like we all agreed it was a logical move and it was done.

What are your favorite things to do when you’re not at the office?

Traveling and travel hacking. I optimize points, miles, signup offers, and all that kind of stuff to take trips. I recently took a month off to take an Alaskan cruise and a trip to Japan, and I didn’t spend nearly as much money as you’d think to do those things. I also have written modules for my home automation system so that I get Telegram messages when things happen at home and so that I can control my house via the bot.

What’s something about you not found on your LinkedIn profile?

In keeping with the traveling mention above, I’ve flown enough miles to have flown to the moon (at perigee).

BOSS: Automatically Identifying Performance Bottlenecks through Big Data

Ruixuan Hou — Wed, 11 Jan 2017 15:13:00 -0800

Introduction

As the centralized performance team of LinkedIn, our mission is to make LinkedIn pages load faster. We help each engineering team try to hit their page load time goals through various optimization efforts. One common question we need to answer when trying to decrease page load time is: where is the performance bottleneck? In other words, where should the engineers focus their efforts? Usually, to answer this question, a performance engineer will look into performance metrics and check some samples captured by Resource Timing API and Call Graph and locate the hotspots. This approach can be very useful, but had the drawback of “trial and error.” Also, many sample waterfalls have to be clicked and analyzed manually to find bottlenecks. We wanted a systematic way that a tool could automatically provide bottleneck details quickly based on existing, large amounts of data.

In this blog, we’ll discuss BOSS (BOttlenecks for Site Speed), a system we built at LinkedIn that analyzes millions of waterfall samples and automatically identifies bottlenecks for improving performance.

Bottleneck analysis is hard

There are a couple of problems with “manual” bottleneck analysis.

Dealing with multiple performance data sources: multiple systems serve the user’s request, and performance data is tracked separately. We have browser-side measurements using Navigation Timing, Resource Timing API, measurements from native applications (iOS, Android), and server-side tracking data like Call Graph. Each data source has its own unique schema, which makes it difficult to process all of them in one place.
Handling a large volume of performance data: it is important to analyze 100% of our traffic to find the most important bottlenecks. Usually, to find the bottleneck of a page, a performance engineer can look into some samples, and identify some hotspots, but not all. This means that it is easy to miss real bottlenecks and to possibly focus on wrong projects. We wanted to analyze all the data to make sure every LinkedIn member is happy with our site speed. Therefore, we needed to make sure the system could process 100M+ records per day.
Quantifying paralleled calls: finding a bottleneck is not as easy as simply finding the longest request in a waterfall since if there are other calls in parallel, just fixing the longest request can’t reduce the page load time. We needed a model to take both call duration and parallelization into consideration.
Interpreting the performance metrics: there is a lot of domain-specific terminology in performance data, e.g., DNS connection time, redirect time, client render time, etc. It is not easy for developers to understand at first glance why is a page slow. Instead of showing the raw metrics, we wanted to provide actionable items in the result. Examples could include fixing the high response time of your frontend server, removing HTTP redirects in your pages, parallelizing third party request with other calls, etc.

In the following sections, we will explain how we addressed these challenges through BOSS.

Call tree model to unify performance data sources

The toughest part of automating the analysis is putting various data sources together. We have performance tracking data on both the client side and server side. Those datasets are located separately and have different schemas. To resolve this, we built a generic call tree model to glue data together.

One click from the end user will result in multiple requests to multiple systems. As illustrated below, one typical page view contains API requests to our data center, image/JS/CSS requests to CDNs, and some requests to third parties like Ads. Those requests spread out to multiple systems, and we need a way to trace them in one place.

What does this look like? A tree!

At LinkedIn, we’ve already built call trees between different services, which are found inside our data centers. If we apply this concept on other systems like CDN, third party ads, browsers, etc., we get a bigger call tree.

Simplifying client-side waterfall

We use Resource Timing data to build the client-side call tree model. However, the raw waterfall contains many page-level navigation timing metrics, like redirect duration, first byte time, page download time, etc., as well as more than a hundred resource timing entries for all the downloads associated with the page—HTTP calls with varying URLs and resource types. This makes it hard to determine the cause of slowness. Bottlenecks need to be actionable. For example, if profile pictures are generally slow to download, it means our media CDN needs to be investigated, rather than everyone’s profile pictures.

To solve this problem, we came up with bottleneck types that we assigned to each resource/metric in the waterfall in order to get actionable bottlenecks.

Bottleneck Type	Resource Timing/Navigation Timing data source
Server-side	Time to first byte and content download time of request to linkedin.com domain
CDN	Request to our CDN domain
Long native code execution time (JS/parsing, rendering) on client	Gaps in waterfall and customized markers using user timing data
Third-party content	Request not served by LinkedIn-owned domains
Redirect	Redirect time and count of navigation timing data

Note that we saw a lot of gaps in waterfalls where no network activities happened. After debugging locally, we found that there is a lot of heavy native code execution (js/parsing/rendering) during these gaps. To measure this more precisely, we started using user timing API to measure the key rendering paths so that we get more insight.

Call trees analysis

Once we have the “combined tree,” the next challenge is how to analyze these trees to find the performance bottlenecks of each page. Basically, there are two kinds of calls that hurt performance:

Slow calls
Sequential/blocking calls

Developers are very sensitive to the latency of individual requests, but sometimes neglect the importance of parallelization of calls. Let’s use two hypothetical pageviews as example. The first one parallelizes the CSS and JS calls with the HTML request, but the second one does not. The latency for all calls is the same, but the page load time has 1,100ms difference! The bottleneck here is that the HTML request is blocking the CSS and JS requests.

BOSS penalizes both slow calls and blocking calls. In the example above, our bottleneck analysis will give the HTML 38.9% for parallelled page view and 72.2% for unparallelled page view. Even if the duration of each call is the same, the blocking HTML call gets more penalties in our analysis and is marked as the bottleneck of the page. The algorithm we use is based on existing service contribution tools, and looks into the call tree between services. Given that client-side data also fits into our generic call tree model well, and we now have a “combined call tree,” we can apply heuristics from service contribution to the client side.

To understand the algorithm that calculates the bottleneck contribution, let’s use a simplified version of page view call tree. The timeline is divided into different segments based on the start and end duration of each call. For each time segment, if there are multiple calls happening in parallel, we will contribute this segment evenly to every call. In the example below, the 90ms segment is divided into 3 pieces and each service gets assigned a 30ms contribution. For the 100ms segment of Call A, there are no other calls in parallel and it gets blamed for the whole time segment, which will result in a very high contribution for Call A. Calls with high contributions will be the top bottlenecks. As a result, Call A gets 195ms, which is 60.9% of the total page load time of 320ms, due to the first 100ms blocking segment plus its long duration. Call B gets 20.3% and Call C gets 18.8%, since they are well-paralleled with each other.

Analyzing performance bottlenecks at scale

Processing the call tree data is non-trivial. We are getting millions of page view records and each record creates a call tree with hundreds of nodes. Here are the requirements of our data processing system and the solutions we picked:

Scalable to massive amount of data: we picked a Kafka + Hadoop solution to handle the data, which has proven to be very successful at LinkedIn.
Ability to slice and dice data into different dimensions and aggregated metrics: what is the bottleneck for the slowest 10% of members? What is the bottleneck for members using a 4G Network? What is the bottleneck for members who only visit our website once a week? We picked Apache Pig as the language of choice to perform these tasks in Hadoop.
Fast iterations on call tree analysis algorithms: the logic of processing a call tree is complicated and needs to be tuned in multiple iterations. It is impractical to use Pig to handle the logic and test it in Hadoop every time. To solve this, we use UDFs (user defined functions) written in Java/Python to handle the call tree analysis logic and write unit tests for fast iterations.

With the scalable data analysis system available, the next challenge is building an algorithm to analyze the call tree.

UI for bottleneck analysis: putting it all together

On the UI side, we built the following components to assist the analysis:

Bar chart and pie charts to highlight top bottlenecks;
Stacked trending chart to show how the bottleneck changes over time;
A table for easy sorting and look up.

On the same UI, users can also view the latency distribution of page views in a scatter plot and click each point to get the full waterfall of the page.

With this powerful tool, users can easily find the bottlenecks with a few clicks.

Real-world example

Here is a bottleneck analysis we ran for our LinkedIn desktop home page in March 2016. Each type of bottleneck has a contribution ratio, which means “by removing this bottleneck, how much site speed improvement we can have.” Here is the table that lists top bottlenecks and the solutions to fix each.

Bottleneck Type	Contribution Ratio (%)	How to Fix
Server response	27.32 %	Server-side bottleneck mainly comes from slower services
Gaps without network activities	22.16 %	This bottleneck indicates some JS execution or browser parsing and rendering are blocking other network requests
CDN objects	20.16 %	This bottleneck indicates slower CDN vendor or big image/JS/CSS size
AJAX calls	9.21 %	This can be fixed by either consolidating the AJAX call with the HTML response or deferring it until after page load
Ads calls	7.57 %	Ads calls should not be blocking the following calls
Redirects	4.50 %	Unnecessary redirects should be avoided; they are always blocking the following request
Network connection	3.65 %	This indicates slow TCP handshake connections, a problem that can come from the local ISP provider or our proxy server

After finding the bottlenecks, the next step was locating the particular requests/code causing the bottlenecks. We examined a couple of waterfalls and found that a long gap happens frequently after downloading the ads. That indicates that there are heavy JavaScript executions after downloading the ads and the next requests are waiting for the JavaScript to finish execution.

The fix was easy: we just made the ads load in an unblocking way so that images and other things can be loaded at the same. After running A/B testing, it turned out that our home page became 21% faster in page load time. At the same time, we’ve seen a boost in engagement metrics; after deferring ads, users are more engaged with the site.

And in our bottleneck analysis data, the contribution of gaps and ads calls dropped significantly.

Conclusion

We have built a bottleneck analysis tool, BOSS, that can process data at scale and produce actionable optimization items. We already have several success stories so far but plan to make additional improvements, including:

More focus on server-side analysis. So far, BOSS works well for client-side analysis. We want to do the same on the server side. For example, bottleneck analysis for different API endpoints, service calls across different data centers, etc.
Auto suggestion for performance optimization. Currently, the tool is used in a passive mode: analysis is only done when a user visits our tool and wants to do some optimization. Instead, we want to automatically run analysis for every page and send optimization suggestions directly to page owners. In this way, we can bubble up performance issues easily and make improvements early.

Acknowledgements

Thanks to David He and Ritesh Maheshwari for the invaluable input and feedback. Thanks to Toon Sripatanaskul for his awesome service contribution algorithm. Thanks to Swapnil Ghike and Joseph Zemek for their pioneering efforts for this project. And thanks to Oliver Tse for being our avid user and for his valuable feedback. And finally, thanks to Steven Pham and Dylan Harris for their help in developing the BOSS UI.

Asynchronous Processing and Multithreading in Apache Samza, Part II: Experiments and Evaluation

Xinyu Liu — Fri, 06 Jan 2017 08:42:00 -0800

This post is the second in a series discussing asynchronous processing and multithreading in Apache Samza. In the previous post, we explored the design and architecture of the new AsyncStreamTask API and the asynchronous event loop. In this post, we will focus on the study of the performance of this feature with benchmark Samza jobs. Some of the interesting questions are:

Can Samza scale well when a job is doing asynchronous remote I/O, or using multithreading?
Does the capability of out-of-order processing increase the parallelism?
By building first-class support for asynchronous processing, does it impact the performance of existing synchronous jobs?

We did the experiments for jobs with remote data access, local data access, and CPU-bound computations. To summarize what we found:

Experiments with remote I/O jobs show superior performance: the benchmark job is able to scale linearly with the increase of parallelism. Async I/O support further enhances the CPU utilization.
Experiments with RocksDB jobs show certain performance benefits when RocksDB access is frequent. But the performance improvement is far from linear.
For existing CPU-bound jobs, results show that the cost of synchronization is negligible when running all tasks on the event loop thread. When the job is running in the thread pool with short CPU-bound process tasks, the performance degrades significantly.

Part I: Remote I/O

Setup
We use a remote job which consumes a PageViewEvent topic of 10 partitions, and then request the member’s inbox information for each event using REST calls. The call duration varies from 1ms to 200ms. The job runs on a single worker-node Yarn cluster with one container (one core) and 1GB memory.

Baseline
The baseline of the process rate is about 20 messages/sec, which is measured when running all tasks in the legacy sequential execution model of Samza.

Experiment results
We tested both the blocking I/O and asynchronous I/O cases.

Blocking I/O: The following figure shows the performance enhancements from baseline to multithreaded execution. Initial thread pool size is 10, same as partition count. The process rate goes up to above 250 messages/sec. To test the concurrency within a task, we further increase the task max concurrency to 3. We also increase the thread pool size to be 30, so each task can process 3 messages in parallel. The process rate goes above 1,000.

Async I/O: To evaluate the performance for asynchronous I/O, we use the asynchronous I/O client of Rest.li in an AsyncStreamTask. This client uses efficient non-blocking HTTP connections and does not require threads waiting for I/O completion. We set task max concurrency to be 1 first, and then increased it to be 3. The results are similar to the previous experiment with thread pool size 10 and then 30, and the resource utilization of CPU and memory is much less, since there are no additional threads waiting for I/O.

Part II: RocksDB

Setup
We use a Samza job that reads/writes to RocksDB. The job keeps count of how many times each input message key has been seen. For each message, it extracts the key, looks it up from the local RocksDB store, and updates the count. The input stream has 8 partitions and is preloaded with 1G messages, with message size around 100 bytes. Each key is a unique number ranging from from 1 to 1G, so each process will have cache miss and access RocksDB. The job uses one container and runs for 30 minutes.

Baseline
The baseline of the performance test processes about 55K messages/second in the legacy single-thread execution model.

Experiment results
Since the task is synchronous, we run the job in the built-in thread pool. The test results are the following:

Tests	Process envelopes (avg/sec)	User CPU utilization(percent)	Heap used (mean MB)
Baseline	54,926	17.3	1,578
Multithreading (threadpool: 0)	50,259 (-8.5%)	16.6 (-4%)	1,532 (-3%)
Multithreading (threadpool: 1)	63,601 (+15.7%)	24.2 (+40%)	1,644 (+4%)
Multithreading (threadpool: 8)	89,816 (+63.5%)	40.9 (+136%)	1,665 (+5%)
Multithreading (threadpool: 8, max concurrency: 3)	119,308 (+117.2%)	48.7 (+181%)	1,685 (+7%)

When thread pool size is 0, all tasks run on the event loop thread, and we see a small overhead of synchronization. When we increase the thread count, the multithreading model performs better and better. In the last experiment, we set the max concurrency per task to 3, and we see both processing rate and CPU usage increase more. During all the tests, the memory heap usage is mostly the same.

Part III: CPU-bound

Setup
Here, we conduct an experiment that compares the single-node peak performance with the previously published results. We used the same test job, which has randomly-generated keys of key space (1M) and let the cache size also be 1M objects to avoid the cache miss after the cache warms up. The job consumes from a topic with 48 partitions. The lookup and updates will happen mostly in cache, which is periodically flushed to RocksDB. The job has 24 containers and runs in a single worker-node Yarn cluster.

Baseline
As mentioned in the published results, Samza can process around 1.1M messages per second on a single node.

Experiment results
When running all tasks in the asynchronous event loop thread (thread pool size is 0), performance is on par with the baseline. After increasing the thread pool size to be 1, the performance degrades to about ⅓. This is caused by the overhead of the thread scheduling (which is about 5 microseconds on average in this host), much longer than the process time. This study shows that for CPU-bound jobs, single-thread execution is still optimal, and the asynchronous event processing model performs on par with the legacy synchronous processing model in this case.

Execution mode	Process rate (messages/sec)
Baseline	1.14 M
Multithreading (threadpool: 0)	1.12 M
Multithreading (threadpool: 1)	0.37 M

Summary

In practice, we see both performance and efficiency gains in running Samza jobs with the new asynchronous processing API. We expect future improvements in resource utilization in our current Samza jobs.

Acknowledgements

Thanks to the Samza engineering teams at LinkedIn for the invaluable help with the design and implementation of this feature. Thanks to all Samza customers for their feedback on the use cases of remote access in practice.

Asynchronous Processing and Multithreading in Apache Samza, Part I: Design and Architecture

Xinyu Liu — Wed, 04 Jan 2017 09:03:00 -0800

As part of the Apache Samza 0.11 release, we rebuilt Samza’s underlying event processing engine to use an asynchronous and parallel processing model. The new model is unique among current open source stream processors because it not only supports running traditional synchronous processing in parallel on multiple threads, but also provides first-class support for asynchronous processing, which is useful for applications that need to perform non-blocking I/O. With this support, a user job can now perform either in-order processing or out-of-order processing with certain processing semantics guaranteed. This post introduces the new Samza asynchronous API and model, explores the details of the asynchronous event loop, and finally discusses the semantics guaranteed when processing messages.

Introduction

One of the common problems faced in stream processing is data access. The requirements, as discussed here, are to support the following access patterns at scale: read/write data, for maintaining the application state; and read-only data, for looking up adjunct data. To address these problems, modern stream processing frameworks have been advocating local data access, such as an in-memory store (Apache Spark, Apache Storm, Apache Flink) or RocksDB store (Apache Samza), backed by durable storage such as HDFS or Kafka. When the data can be co-located with the processor, the local data access greatly improves the performance. As reported here, a Samza test job with local RocksDB state store is able to process 1.1 million requests per second on a single machine (1.7T SSD, 48G RAM).

The challenges of remote data access, however, remain largely unchanged: performance bottlenecks caused by slow I/O, parallelism limited by the execution unit (processes/threads), and increasing hardware cost at scale. The challenges are further compounded by the complexity of handling the ordering of the event processing and checkpointing. For example, a Samza application named Air Traffic Control (ATC) is used for delivering LinkedIn emails and notifications to members. It aggregates multiple channels, such as member chats, activities, and network updates. These communications require looking up data from multiple remote data sources, including invitations, mailbox, connection graph, network feed, and comments. In the absence of efficient remote data access support, ATC has to manage its own threads to parallelize the remote data access for better performance and resource utilization. ATC also handles the process ordering and checkpointing internally by itself, which increases the development cost and code complexity dramatically.

Modern application/service platforms, such as Node.js and Play framework, address the remote data access problems by supporting asynchronous user applications. These applications make asynchronous I/O (or non-blocking I/O) calls to start I/O communication, and then perform other operations which do not depend on the results of the communication. This allows overlap of processing and I/O, with notification of I/O completion. A process can perform multiple I/O requests and allow the kernel to handle the data transfer details. As discussed here, this allows asynchronous I/O to boost application performance greatly, as well as reducing CPU and memory footprint.

To the best of our knowledge, none of the existing open source stream processors support asynchronous I/O applications. In terms of parallelism, most of them have concurrency on the task level (a task usually processes a single workload after grouping). Flink is the only processor that support parallel instances of a task, but the ordering and checkpointing semantics are not specified.

Asynchronous processing and multithreading model

For the reasons listed above, Apache Samza will now provide first-class support for asynchronous processing. The new Samza asynchronous task API uses a callback-based approach to support asynchronous I/O. This allows easy integration with most async I/O libraries. For the applications using synchronous processing, Samza allows parallelism among tasks by simple configs. Samza further supports parallelism within a task, in case even more parallelism is required. If you are new to Apache Samza, this article introduces the basic concepts of Samza.

Asynchronous API

We had two requirements for the design of the Samza asynchronous API:

Support asynchronous processing with non-blocking I/O calls.
Support various concurrency libraries, such as Akka (actor-based), Parseq (task-based), and JDeferred (deferral-based).

To meet both requirements, we designed the API using the most primitive construct in concurrent programming: callbacks. Sequential execution can be implemented using synchronous callbacks, and asynchronous execution can be implemented using asynchronous callbacks triggered by the user thread. Furthermore, callbacks can be seamlessly integrated with concurrency libraries by invoking the callback after the concurrent computation is done. For example, multithreaded execution can be achieved by running the tasks in a thread pool and invoking the asynchronous callbacks upon completion.

The full AsyncStreamTask API is defined here. Below we use an example task to illustrate the API:

In this example, the task makes an asynchronous request to some remote service (http://example.com/resource) using Jersey client, and triggers the task callback from the Jersey callback invocation thread. The callback will notify the Samza engine that the message has been processed completely by the task, and the next message will be dispatched for the task to process (assuming task max concurrency is 1).

Event loop

In Samza, the event loop in a container is responsible for running multiple user tasks for consuming/producing messages, windowing, and checkpointing. (For more information on windows and checkpoints in Samza, see this article). To support callback-based asynchronous processing, we implemented a brand new event loop. The following flowchart illustrates how the event loop works for one task:

The event loop works as follows:

When a message event (1) comes from the system consumers, Samza will check Cond 1. Internally, Samza keeps a counter of the outstanding callbacks for each task, and invokes 1a task.processAsync() if the counter value is less than the max concurrency allowed. When the user task finishes the asynchronous process, it will trigger callback in the user thread. Then in 1b, the callback will notify the event loop and update the counter.
When a window timer event (2) comes, Samza will check Cond2 to make sure there are no outstanding task actions. Then Samza will invoke 2a task.window(). When the user window function completes, the event loop will be notified as in 2b and continue to process future events.
When a checkpoint event (3) comes, Samza will do the same check as for a window event (2). Then Samza will invoke 3a checkpoint for the task. When the checkpoint completes in 3b, the event loop will be notified and continue to process future events.
Loop 1 to 4 runs until all tasks have outstanding actions (processAsync, window, or checkpoint). Then the loop will block itself until the next task becomes available to handle new events.

Guaranteed semantics

In order to provide useful asynchronous processing in Samza core, we designed the following semantic models and guarantees. With them, the user job is relieved from error-prone synchronization so that it can truly focus on the processing itself.

Parallelism
With the above processing model, Samza provides multiple levels of parallelism. If the callback is invoked synchronously, all tasks will be run in the same event loop thread. In this scenario, the model provides the traditional Samza container-level parallelism. If the callback is invoked asynchronously, Samza supports task-level parallelism. Tasks can process in parallel without blocking each other. This is because the invocation of processAsync in one task doesn’t require waiting for the callback/window/commit from other tasks. Furthermore, Samza supports within-task-level parallelism. This is achieved by allowing multiple processAsync invocations for one task without waiting for the callbacks to complete. The max concurrency, configured by task.max.concurrency, is enforced when scheduling the next processAsync inside the event loop.

It is also straightforward to run the synchronous tasks in parallel with multithreading. Samza ships with a built-in thread pool which makes multithreading work out of the box. Without any code change, the user can simply configure the property job.container.thread.pool.size and then the tasks are executed in parallel. However, asynchronous I/O is the prefered way to access remote data due to superior performance and better scalability for more parallelism.

Ordering
Let’s discuss the ordering guarantees under different scenarios of parallelism. For parallelism on the container or task level, messages are guaranteed to be processed in order. For parallelism within a task, Samza guarantees processAsync will be invoked in order for a task. The processing or completion, however, can go out of order. With this guarantee, users can implement sub-task-level data pipelining with customized ordering and parallelism. For example, users can use a keyed single thread executor pool to have in-order processing per key while processing messages with different keys in parallel.

Checkpointing
Samza guarantees it will only checkpoint the messages that are completely processed. Samza uses a low watermark model for checkpointing. It maintains a queue of the completed callbacks, sorted by the callback sequence number. When checkpointing happens, the max offset of the contiguous callback sequence (the low watermark) will be committed. The head of the queue will advance to the next callback for the future checkpointing.

Memory visibility and happen-before semantics
The following semantics are guaranteed in the above processing model:

Event processing within Samza is thread-safe. You can safely access your job’s state in the local RocksDB key-value store, write messages, and checkpoint offsets in the task threads. Any other state or code shared between tasks, e.g., global variables or static fields, is not thread-safe if it can be accessed by multiple threads. Samza guarantees the mutual exclusiveness of process, window, and commit so there will be no concurrent modifications among these operations and any state change from one operation will be fully visible to the others.
Each processAsync is guaranteed to happen before the next invocation of processAsync of the same task. If task max concurrency is 1, the completion of the processing is guaranteed to happen before the next invocation of processAsync of the same task. If task max concurrency is greater than 1, there is no such happens-before constraint.
window is called when no invocations to processAsync are pending and no new processAsync invocations can be scheduled until it completes. Samza guarantees that all previous processAsync invocations happen before an invocation of window. An invocation of window is guaranteed to happen before any subsequent processAsync invocations. The Samza engine is responsible for ensuring that window is invoked in a timely manner.
checkpoint is guaranteed to only cover events that are fully processed. It happens only when there are no pending processAsync or window invocations. All preceding invocations happen before checkpointing, and checkpointing happens before all subsequent invocations.

Summary

Samza has made the first attempt to bridge the gap between stream processing and asynchronous I/O for remote data access, which is commonly found in modern application/service asynchronous application/service platforms. Samza provides support for different granularities of parallelism, from the container-level to concurrency within a task. Samza also supports both in-order and out-of-order process. Finally, Samza provides practical semantic guarantees to reduce the user job complexity.