Heroku Engineering Blog

Sockets in a Bind: Troubleshooting Port Exhaustion in Heroku's Routing Layer

Thu, 30 Mar 2017 15:00:00 +0000

Back on August 11, 2016, Heroku experienced increased routing latency in the EU region of the common runtime. While the official follow-up report describes what happened and what we’ve done to avoid this in the future, we found the root cause to be puzzling enough to require a deep dive into Linux networking.

The following is a write-up by SRE member Lex Neva (what’s SRE?) and routing engineer Fred Hebert (now Heroku alumni) of an interesting Linux networking “gotcha” they discovered while working on incident 930.

The Incident

Our monitoring systems paged us about a rise in latency levels across the board in the EU region of the Common Runtime. We quickly saw that the usual causes didn’t apply: CPU usage was normal, packet rates were entirely fine, memory usage was green as a summer field, request rates were low, and socket usage was well within the acceptable range. In fact, when we compared the EU nodes to their US counterparts, all metrics were at a nicer level than the US ones, except for latency. How to explain this?

One of our engineers noticed that connections from the routing layer to dynos were getting the POSIX error code EADDRINUSE, which is odd.

For a server socket created with listen(), EADDRINUSE indicates that the port specified is already in use. But we weren’t talking about a server socket; this was the routing layer acting as a client, connecting to dynos to forward an HTTP request to them. Why would we be seeing EADDRINUSE?

TCP/IP Connections

Before we get to the answer, we need a little bit of review about how TCP works.

Let’s say we have a program that wants to connect to some remote host and port over TCP. It will tell the kernel to open the connection, and the kernel will choose a source port to connect from. That’s because every IP connection is uniquely specified by a set of 4 pieces of data:

( <SOURCE-IP> : <SOURCE-PORT> , <DESTINATION-IP> : <DESTINATION-PORT> )

No two connections can share this same set of 4 items (called the “4-tuple”). This means that any given host (<SOURCE-IP>) can only connect to any given destination (<DESTINATION-IP>:<DESTINATION-PORT>) at most 65536 times concurrently, which is the total number of possible values for <SOURCE-PORT>. Importantly, it’s okay for two connections to use the same source port, provided that they are connecting to a different destination IP and/or port.

Usually a program will ask Linux (or any other OS) to automatically choose an available source port to satisfy the rules. If no port is available (because 65536 connections to the given destination (<DESTINATION-IP>:<DESTINATION-PORT>) are already open), then the OS will respond with EADDRINUSE.

This is a little complicated by a feature of TCP called “TIME_WAIT”. When a given connection is closed, the TCP specification declares that both ends should wait a certain amount of time before opening a new connection with the same 4-tuple. This is to avoid the possibility that delayed packets from the first connection might be misconstrued as belonging to the second connection.

Generally this TIME_WAIT waiting period lasts for only a minute or two. In practice, this means that even if 65536 connections are not currently open to a given destination IP and port, if enough recent connections were open, there still may not be a source port available for use in a new connection. In practice even fewer concurrent connections may be possible since Linux tries to select source ports randomly until it finds an available one, and with enough source ports used up, it may not find a free one before it gives up.

Port exhaustion in Heroku’s routing layer

So why would we see EADDRINUSE in connections from the routing layer to dynos? According to our understanding, such an error should not happen. It would indicate that 65536 connections from a specific routing node were being made to a specific dyno. This should mean that the theoretical limit on concurrent connections should be far more than a single dyno could ever hope to handle.

We could easily see from our application traffic graphs that no dyno was coming close to this theoretical limit. So we were left with a concerning mystery: how was it possible that we were seeing EADDRINUSE errors?

We wanted to prevent the incident from ever happening again, and so we continued to dig - taking a dive into the internals of our systems.

Our routing layer is written in Erlang, and the most likely candidate was its virtual machine’s TCP calls. Digging through the VM’s network layer we got down to the sock_connect call which is mostly a portable wrapper around the linux connect() syscall.

Seeing this, it seemed that nothing in there was out of place to cause the issue. We’d have to go deeper, in the OS itself.

After digging and reading many documents, one of us noticed this bit in the now well-known blog post Bind before connect:

Bind is usually called for listening sockets so the kernel needs to make sure that the source address is not shared with anyone else. It’s a problem. When using this techique [sic] in this form it’s impossible to establish more than 64k (ephemeral port range) outgoing connections in total. After that the attempt to call bind() will fail with an EADDRINUSE error - all the source ports will be busy.

[…]

When we call bind() the kernel knows only the source address we’re asking for. We’ll inform the kernel of a destination address only when we call connect() later.

This passage seems to be describing a special case where a client wants to make an outgoing connection with a specific source IP address. We weren’t doing that in our Erlang code, so this still didn’t seem to fit our situation well. But the symptoms matched so well that we decided to check for sure whether the Erlang VM was doing a bind() call without our knowledge.

We used strace to determine the actual system call sequence being performed. Here’s a snippet of strace output for a connection to 10.11.12.13:80:

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
*bind*(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.11.12.13")}, 16) = 0

To our surprise, bind() was being called! The socket was being bound to a <SOURCE-IP>:<SOURCE-PORT> of 0.0.0.0:0. Why?

This instructs the kernel to bind the socket to any IP and any port. This seemed a bit useless to us, as the kernel would already select an appropriate <SOURCE-IP> when connect() was called, based on the destination IP address and the routing table.

This bind() call seemed like a no-op. But critically, this call required the kernel to select the <SOURCE-IP> right then and there, without having any knowledge of the other 3 parts of the 4-tuple: <SOURCE-IP>, <DESTINATION-IP>, and <DESTINATION-PORT>. The kernel would therefore have only 65536 possible choices and might return EADDRINUSE, as per the bind() manpage:

EADDRINUSE (Internet domain sockets) The port number was specified as zero in the socket address structure, but, upon attempting to bind to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range ip(7).

Unbeknownst to us, we had been operating for a very long time with far lower of a tolerance threshold than expected – the ephemeral port range was effectively a limit to how much traffic we could tolerate per routing layer instance, while we thought no such limitation existed.

The Fix

Reading further in Bind before connect yields the fix: just set the SO_REUSEADDR socket option before the bind() call. In Erlang this is done by simply passing {reuseaddr, true}.

At this point we thought we had our answer, but we had to be sure. We decided to test it.

We first wrote a small C program that exercised the current limit:

#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <string.h>
#include <arpa/inet.h>
#include <unistd.h>

int main(int argc, char **argv) {
  /* usage: ./connect_with_bind <num> <dest1> <dest2> ... <destN>
   *
   * Opens <num> connections to port 80, round-robining between the specified
   * destination IPs.  Then it opens the same number of connections to port
   * 443.
   */

  int i;
  int fds[131072];
  struct sockaddr_in sin;
  struct sockaddr_in dest;

  memset(&sin, 0, sizeof(struct sockaddr_in));

  sin.sin_family = AF_INET;
  sin.sin_port = htons(0);  // source port 0 (kernel picks one)
  sin.sin_addr.s_addr = htonl(INADDR_ANY);  // source IP 0.0.0.0

  for (i = 0; i < atoi(argv[1]); i++) {
    memset(&dest, 0, sizeof(struct sockaddr_in));
    dest.sin_family = AF_INET;
    dest.sin_port = htons(80);

    // round-robin between the destination IPs specified
    dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

    fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
    connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
  }

  sleep(5);

  fprintf(stderr, "GOING TO START CONNECTING TO PORT 443\n");

  for (i = 0; i < atoi(argv[1]); i++) {
    memset(&dest, 0, sizeof(struct sockaddr_in));
    dest.sin_family = AF_INET;
    dest.sin_port = htons(443);
    dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

    fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
    connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
  }

  sleep(5);
}

We increased our file descriptor limit and ran this program as follows:

./connect_with_bind 65536 10.11.12.13 10.11.12.14 10.11.12.15

This program attempted to open 65536 connections to port 80 on the three IPs specified. Then it attempted to open another 65536 connections to port 443 on the same IPs. If only the 4-tuple were in play, we should be able to open all of these connections without any problem.

We ran the program under strace while monitoring ss -s for connection counts. As expected, we began seeing EADDRINUSE errors from bind(). In fact, we saw these errors even before we’d opened 65536 connections. The Linux kernel does source port allocation by randomly selecting a candidate port and then checking the N following ports until it finds an available port. This is an optimization to prevent it from having to scan all 65536 possible ports for each connection.

Once that baseline was established, we added the SO_REUSEADDR socket option. Here are the changes we made:

--- connect_with_bind.c	2016-12-22 10:29:45.916723406 -0500
+++ connect_with_bind_and_reuse.c	2016-12-22 10:31:54.452322757 -0500
@@ -17,6 +17,7 @@
   int fds[131072];
   struct sockaddr_in sin;
   struct sockaddr_in dest;
+  int one = 1;

   memset(&sin, 0, sizeof(struct sockaddr_in));

@@ -33,6 +34,7 @@
     dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

     fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+    setsockopt(fds[i], SOL_SOCKET, SO_REUSEADDR, &one, sizeof(int));
     bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
     connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
   }
@@ -48,6 +50,7 @@
     dest.sin_addr.s_addr = inet_addr(argv[2 + i % (argc - 2)]);

     fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+    setsockopt(fds[i], SOL_SOCKET, SO_REUSEADDR, &one, sizeof(int));
     bind(fds[i], (struct sockaddr *)&sin, sizeof(struct sockaddr_in));
     connect(fds[i], (struct sockaddr *)&dest, sizeof(struct sockaddr_in));
   }

We ran it like this:

./connect_with_bind_and_reuse 65536 10.11.12.13 10.11.12.14 10.11.12.15

Our expectation was that bind() would stop returning EADDRINUSE. The new program confirmed this fairly rapidly, and showed us once more that what you may expect from theory and practice has quite a gap to be bridged.

Knowing this, all we had to do is confirm that the {reuseaddr, true} option for the Erlang side would work, and a quick strace of a node performing the call confirmed that the appropriate setsockopt() call was being made.

Giving Back

It was quite an eye-opening experience to discover this unexpected connection limitation in our routing layer. The patch to Vegur, our open-sourced HTTP proxy library, was deployed a couple of days later, preventing this issue from ever biting us again.

We hope that sharing our experience here, we might save you from similar bugs in your systems.

How We Found and Fixed a Filesystem Corruption Bug

Wed, 15 Feb 2017 16:50:00 +0000

As part of our commitment to security and support, we periodically upgrade the stack image, so that we can install updated package versions, address security vulnerabilities, and add new packages to the stack. Recently we had an incident during which some applications running on the Cedar-14 stack image experienced higher than normal rates of segmentation faults and other “hard” crashes for about five hours. Our engineers tracked down the cause of the error to corrupted dyno filesystems caused by a failed stack upgrade. The sequence of events leading up to this failure, and the technical details of the failure, are unique, and worth exploring.

Background

Heroku runs application processes in dynos, which are lightweight Linux containers, each with its own, isolated filesystem. Our runtime system composes the container’s filesystem from a number of mount points. Two of these mount points are particularly critical: the /app mount point, which contains a read-write copy of the application, and the / mount point, which contains the container’s stack image, a prepared filesystem with a complete Ubuntu installation. The stack image provides applications running on Heroku dynos with a familiar Linux environment and a predictable list of native packages. Critically, the stack image is mounted read-only, so that we can safely reuse it for every dyno running on the same stack on the same host.

Given the large number of customer dynos we host, the stack upgrade process is almost entirely automated, and it’s designed so that a new stack image can be deployed without interfering with running dynos so that our users aren’t exposed to downtime on our behalf. We perform this live upgrade by downloading a disk image of the stack to each dyno host and then reconfiguring each host so that newly-started dynos will use the new image. We write the newly-downloaded image directly to the data directory our runtime tools use to find images to mount, so that we have safety checks in the deployment process, based on checksum files, to automatically and safely skip the download if the image is already present on the host.

Root Causes

Near the start of December, we upgraded our container tools. This included changing the digest algorithms and filenames used by these safety checks. We also introduced a latent bug: the new version of our container tools didn’t consider the checksum files produced by previous versions. They would happily install any disk image, even one that was already present, as long as the image had not yet been installed under the new tools.

We don’t often re-deploy an existing version of a stack image, so this defect might have gone unnoticed and would eventually have become irrelevant. We rotate hosts out of our runtime fleet and replace them with fresh hosts constantly, and the initial setup of a fresh host downloads the stack image using the same tools we use to roll out upgrades, which would have protected those hosts from the defect. Unfortunately, this defect coincided with a second, unrelated problem. Several days after the container tools upgrade, one of our engineers attempted to roll out an upgrade to the stack image. Issues during this upgrade meant that we had to abort the upgrade, and our standard procedure to ensure that all container hosts are running the same version when we abort an upgrade involves redeploying the original version of the container.

During redeployment, the safety check preventing our tools from overwriting existing images failed, and our container tools truncated and overwrote the disk image file while it was still mounted in running dynos as the / filesystem.

Technical Impact

The Linux kernel expects that a given volume, whether it’s backed by a disk or a file, will go through the filesystem abstraction whenever the volume is mounted. Reads and writes that go through the filesystem are cached for future accesses, and the kernel enforces consistency guarantees like “creating a file is an atomic operation” through those APIs. Writing directly to the volume bypasses all of these mechanisms, completely, and (in true Unix fashion) the kernel is more than happy to let you do it.

During the incident, the most relevant consequence for Heroku apps involved the filesystem cache: by truncating the disk image, we’d accidentally ensured that reads from the image would return no data, while reads through the filesystem cache would return the data from the previously-present filesystem image. There’s very little predictability to which pages will be in the filesystem cache, so the most common effect on applications was that newly-loaded programs would partially load from the cache and partially load from the underlying disk image, mid-download. The resulting corrupted programs crashed, often with a segmentation fault, the first time they executed an instruction that attempted to read any of the missing data, or the first time they executed an instruction that had, itself, been damaged.

During the incident, our response lead put together a small example to verify the effects we’re seeing. If you have a virtual machine handy, you can reproduce the problem yourself, without all of our container infrastructure. (Unfortunately, a Docker container won’t cut it: you need something that can create new mount points.)

Create a disk image with a simple program on it. We used sleep.

 dd if=/dev/zero of=demo.img bs=1024 count=10240
 mkfs -F -t ext4 demo.img
 sudo mkdir -p /mnt/demo
 sudo mount -o loop demo.img /mnt/demo
 sudo cp -a /bin/sleep /mnt/demo/sleep
 sudo umount /mnt/demo

Make a copy of the image, which we’ll use later to simulate downloading the image:
```
 cp -a demo.img backup.img
```
Mount the original image, as a read-only filesystem:
```
 sudo mount -o loop,ro demo.img /mnt/demo
```
In one terminal, start running the test program in a loop:
```
 while /mnt/demo/sleep 1; do
     :
 done
```

In a second terminal, replace the disk image out from underneath the program by truncating and rewriting it from the backup copy:

 while cat backup.img > demo.img; do
     # flush filesystem caches so that pages are re-read
     echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
 done

Reliably, sleep will crash, with Segmentation fault (core dumped). This is exactly the error that affected customer applications.

This problem caught us completely by surprise. While we had taken into account that overwriting a mounted image would cause problems, none of us fully understood what those problems would be. While both our monitoring systems and our internal userbase alerted us to the problem quickly, neither was able to offer much insight into the root cause. Application crashes are part of the normal state of our platform, and while an increase in crashes is a warning sign we take seriously, it doesn’t correlate with any specific causes. We were also hampered by our belief that our deployment process for stack image upgrades was designed not to modify existing filesystem images.

The Fix

Once we identified the problem, we migrated all affected dynos to fresh hosts, with non-corrupted filesystems and with coherent filesystem caches. This work took the majority of the five hours during which the incident was open.

In response to this incident, we now mark filesystem images as read-only on the host filesystem once they’re installed. We’ve re-tested this, under the conditions that lead to the original incident, and we’re confident that this will prevent this and any overwriting-related problems in the future.

We care deeply about managing security, platform maintenance and other container orchestration tasks, so that your apps “just work” - and we’re confident that these changes make our stack management even more robust.

Pulling the Thread on Kafka's Compacted Topics

Wed, 11 Jan 2017 15:00:00 +0000

At Heroku, we’re always working towards improving operational stability with the services we offer. As we recently launched Apache Kafka on Heroku, we’ve been increasingly focused on hardening Apache Kafka, as well as our automation around it. This particular improvement in stability concerns Kafka’s compacted topics, which we haven’t talked about before. Compacted topics are a powerful and important feature of Kafka, and as of 0.9, provide the capabilities supporting a number of important features.

Meet the Bug

The bug we had been seeing is that an internal thread that’s used by Kafka to implement compacted topics (which we’ll explain more of shortly) can die in certain use cases, without any notification. This leads to long-term failures and instability of the Kafka cluster, as old data isn’t cleared up as expected.

To set the stage for the changes we made and the deeper explanation of the bug, we’ll cover what log compaction is briefly, and how it works:

Just What Are Compacted Topics Anyway?

In the default case, Kafka topics are a stream of messages:

[ a:1, b:2, c:3 ]

There is no way to change or delete previous messages, except that messages that are too old get deleted after a specified “retention time.”

Compacted topics provide a very different type of stream, maintaining only the most recent message of a given key. This produces something like a materialized or table-like view of a stream, with up-to-date values for all things in the key space. These compacted topics work by assigning each message a “key” (a simple Java byte[]), with Kafka periodically tombstoning or deleting messages in the topic with superseded keys, or by applying a time-based retention window. This tombstoning of repeated keys provides you with a sort of eventual consistency, with the implication that duplicate messages or keys may be present before the cleaning or compaction process has completed.

While this doesn’t give you real infinite storage – you now have to care about what keys you assign and how big the space of keys grows – it’s a useful primitive for many systems.

At Heroku we use compacted topics pretty sparingly. They’re a much more special purpose tool than regular Kafka topics. The largest user is the team who work on Heroku’s Metrics feature, where they power Threshold Alerts. Heroku Connect is also starting to use them.

Even when end users aren’t taking advantage of compacted topics, Kafka makes extensive use of them internally: they provide the persistence and tracking of which offsets consumers and consumer groups have processed. This makes them an essential part of the codebase, so the reliability of compacted topics matters a lot.

How Do Compacted Topics Really Work?

Given the goal of “removing duplicate keys”, how does Kafka go about implementing this? There are a few important elements. First is that, on disk, Kafka breaks messages up into “segments”, which are plain files:

my_topic_my_partition_1: [ a:1, b:2, c:3]
my_topic_my_partition_4: [ a:4, b:5]

This notation uses key:offset to represent a message, as these are the primary attributes being manipulated for this task. Compaction doesn’t care about message values, except that the most recent value for each key is preserved.

Secondly, a periodic process – the log cleaner thread – comes along and removes messages with duplicate keys. It does this by deleting duplicates only for new messages that have arrived since the last compaction. This leads to a nice tradeoff where Kafka only requires a relatively small amount of memory to remove duplicates from a large amount of data.

The cleaner runs in two phases. In phase 1, it builds an “offset map”, from keys to the latest offset for that key. This offset map is only built for “new” messages - the log cleaner marks where it got to when it finished. In phase 2, it starts from the beginning of the log, and rewrites one segment at a time, removing any message which has a lower offset than the message with that key in the offset map.

Phase 1

Data:

my_topic_my_partition_1: [ a:1, b:2, c:3]
my_topic_my_partition_4: [ a:4, b:5 ]

Offset map produced:

{
    a:4
    b:5
    c:3
}

Phase 2

my_topic_my_partition_1: [ a:1, b:2, c:3]
my_topic_my_partition_4: [ a:4, b:5 ]

Breaking this cleaning down message-by-message:

For the first message, a:1, Kafka looks in the offset map, finds a:4, so it doesn’t keep this message.
For the second message, b:2, Kafka looks in the offset map, finds b:5, so it doesn’t keep this message.
For the third message, c:3, Kafka looks in the offset map, and finds no newer message, so it keeps this message.

That’s the end of the first segment, so the output is:

my_topic_my_partition_1: [ c:3 ]

Then we clean the second segment:

For the first message in the second segment, a:4, Kafka looks in the offset map, finds a:4 and so it keeps this message
For the second message in the second segment, b:5, Kafka looks in the offset map, finds b:5 and so it keeps this message

That’s the end of this segment, so the output is:

my_topic_my_partition_4: [ a:4, b:5 ]

So now, we’ve cleaned up to the end of the topic. We’ve elided a few details here, for example, Kafka has a relatively complex protocol that enables rewriting whole topics in a crash safe way. Secondly, Kafka doesn’t ever build an offset map for the latest segment in the log. This is just to prevent doing the same work over and over - the latest log segment sees a lot of new messages, so there’s no sense in continually recompacting using it. Lastly, there are some optimizations that mean small log segments get merged into larger files, which avoids littering the filesystem with lots of small files. The last part of the puzzle is that Kafka writes down the highest offset in the offset map, for any key, that it last built the offset map to. In this case, offset 5.

Let’s see what happens when we add some more messages (again ignoring the fact that Kafka never compacts using the last segment). In this case c:6 and a:7 are the new messages:

my_topic_my_partition_1: [ c:3 ]
my_topic_my_partition_4: [ a:4, b:5 ]
my_topic_my_partition_6: [ c:6, a:7 ]

Phase 1

Build the offset map:

{
  a: 7,
  c: 6,
}

Note well, that the offset map doesn’t include b:5! We already built the offset map (in the previous clean) up to that message, and our new offset map doesn’t include a message with the key of b at all. This means the compaction process can use much less memory than you’d expect to remove duplicates.

Phase 2

Clean the log:

my_topic_my_partition_4: [ b:5 ]
my_topic_my_partition_6: [ c:6, a:7 ]

What is the bug again?

Prior to the most recent version of Kafka, the offset map had to keep a whole segment in memory. This simplified some internal accounting, but causes pretty gnarly problems, as it leads to the thread crashing if the map doesn’t have enough space. The default settings have log segments grow up to 1GB of data, which at a very small message size can overwhelm the offset map with the sheer number of keys. Then, having run out of space in the offset map without fitting in a full segment, an assertion fires and the thread crashes.

What makes this especially bad is Kafka’s handling of the thread crashing: there’s no notification to an operator, the process itself carries on running. This violates a good fundamental principle that if you’re going to fail, fail loudly and publicly.

With a broker running without this thread in the long term, data that is meant to be compacted grows and grows. This threatens the stability of the node, and if the crash impacts other nodes, the whole cluster.

What is the fix?

The fix was relatively simple, and a common theme in software: “stop doing that bad thing”. After spending quite some time to understand the compaction behavior (as explained above), the code change was a simple 100 line patch. The fix means Kafka doesn’t try to fit a whole segment in the offset map and lets it instead mark “I got partway through a log segment when building the map”.

The first step was to remove the assertion that caused the log cleaner thread to die. Then, we reworked the internal tracking such that we can record a partial segment load and recover from that point.

The outcome now, is that the log cleaner thread doesn’t die silently. This was a huge stress reliever for us - we’ve seen this happen in production multiple times, and recovering from it is quite tricky.

Conclusion

Working with the Kafka community on this bug was a great experience. We filed a Jira ticket and talked through potential solutions. After a short while, Jun Rao and Jay Kreps had a suggested solution, which was what we implemented. After some back and forth with code review, the patch was committed and made it into the latest release of Kafka.

This fix is in Kafka 0.10.1.1, which is now available and the default version on Heroku. You can provision a new cluster like so:

$ heroku addons:create heroku-kafka

For existing customers, you can upgrade to this release of Kafka like so:

$ heroku kafka:upgrade heroku-kafka --version 0.10

How We Sped up SNI TLS Handshakes by 5x

Thu, 22 Dec 2016 16:00:00 +0000

During the development of the recently released Heroku SSL feature, a lot of work was carried out to stabilize the system and improve its speed. In this post, I will explain how we managed to improve the speed of our TLS handshakes by 4-5x.

The initial reports of speed issues were sent our way by beta customers who were unhappy about the low level of performance. This was understandable since, after all, we were not greenfielding a solution for which nothing existed, but actively trying to provide an alternative to the SSL Endpoint add-on, which is provided by a dedicated team working on elastic load balancers at AWS. At the same time, another of the worries we had was to figure out how many more routing instances we would need to absorb the CPU load of a major part of our traffic no longer being simple HTTP, but HTTP + TLS.

Detecting the Problem

The simplest way to work at both of these things is always through benchmarking. So we set up some simple benchmarks that used TLS with no session resumption or caching, and on HTTP requests that were extremely small with no follow-up requests coming over the connection. The objective there was to specifically exercise handshakes:

TLS handshakes are more costly than just encrypting data over a well-established connection.
HTTP keep-alive requests reuse the connection and the first one is technically more expensive (since it incurs the cost of the handshake), so we disabled them to always have the most expensive thing happening.
Small amounts of data to encrypt mean that the weight of the handshake dominates the overall amount of time taken for each query. We wanted more handshakes and less of everything else.

Under such benchmarks, we found that our nodes became slow and unresponsive at a depressingly rapid rate, almost 7-8 times earlier than with the same test over HTTP. This is a far cry from reported overheads of 1%-2% in places like google, although we purposefully went with a very pessimistic pattern. A more realistic pattern would likely have had lower overhead.

This could have easily have been a call for panic. We had used the Erlang/OTP SSL libraries to serve our traffic. While there’s some added safety to having a major part of your stack not written in C and not dependent on OpenSSL, which recently has experienced several notable vulnerabilities, we did run the risk of much worse performance. To be clear, the Erlang SSL library does use OpenSSL bindings for all of the cryptographic functionality but uses Erlang for protocol-level parts of the implementation, such as running state machines. The library has gone through independent audit and testing.

We have a team of people who know Erlang pretty well and are able to profile and debug it, so we decided to see if we could resolve the performance issue just tweaking standard configurations. During initial benchmarking, we found that bandwidth and packet counts were very low and memory usage was as expected. CPU usage was fairly low (~30%) but did tend to jump around quite a bit.

For us, this pointed to a standard kind of bottleneck issue where you see flapping as some work gets done in a batch, then the process waits, then does more work. However, from our internal tools, we could see very little that was not just SSL doing work.

Eventually, we used perf top to look at things, which is far less invasive than most tooling out there (if you’re interested, you should take a look at Julian Squire’s talk at the Erlang User Conference 2016 on this specific topic).

The thing that immediately jumped out at us was that a bunch of functions were taking far more time than we’d expect. Specifically, the following results would show up:

The copy_shallow and do_minor functions are related to garbage collection operations within the Erlang VM. They can be high in regular circumstances, but here they were much, much higher than expected. In fact, GC was taking more time than actual crypto work! The other thing that took more time than crypto was the db_next_hash function, which was a bit funny.

We looked some more, and as more samples came out, the pattern emerged:

CPU time would flap a lot between a lot of garbage collection operations and a lot of db_*_hash operations, whereas given the benchmark, we would have expected libcrypto.so to do the most work.

The db_*_hash operations are specifically related to something called ETS (Erlang Term Storage) tables. ETS tables are an efficient in-memory database included with the Erlang virtual machine. They sit in a part of the virtual machine where destructive updates are allowed and where garbage collection dares not approach. They’re generally fast, and a pretty easy way for Erlang programmers to optimize some of their code when parts of it get too slow.

In this case, though, they appeared to be our problem. Specifically, the next and get operations are expected to be cheap, but select tends to be very expensive and a sign of full-table scans.

By logging onto the node during the benchmark, we could make use of Erlang’s tracing facilities. The built-in tracing functions basically let an operator look at all messages, function calls and function returns, garbage collections, scheduling activities, data transiting in and out of ports, and so on, at the language level. This tracing is higher level than tracing provided by tools such as strace or dtrace.

We simply ran calls to recon_trace:calls({ets, select_delete, '_'}, 100) and saw that all of the calls came from a single process named ssl_manager.

Understanding and Fixing the Problem

The SSL manager in the Erlang library has been a potential bottleneck for a long period of time. Prior to this run, we had already disabled all kinds of cache implementations that turned out to be slower for our use cases. We had also identified this lone, central process as a point of contention by design – we have a few of them and tend to know about them as specific load scenarios exercise them more than others.

The tracing above had also shown that the trace calls were made from a module called ssl_pkix_db, which is in turn called by ssl_manager.

These modules were used by the SSL implementation as a cache for intermediary certificates for each connection. The initial intent was to cache certificates read from disk by Erlang, say from /etc/ssl/.

For us, this would have been costly when the server is fetching the files from disk hundreds or thousands of times a second, decoding them from PEM to DER format, and subsequently to their internal Erlang format.

The cache was set in place such that as soon as one connection requested a given file, it would get decoded once, and then cached in memory through ETS tables with a reference count. Each new session would increment the count, and each terminated session would decrement the count. When the count reaches 0, the cache is dropped. Additional table scans would take place to provide upper time limits for caches. The cache was then used for other operations, such as scanning through CA certs during revocation checks.

The funny bit there is that the SSL library supports an entire other format to handle certificates: have the user decode the PEM data to the DER format themselves and then submit the DER data directly in memory through the connection configuration. For dynamic routers such as Heroku’s, that’s the method we’ve taken to avoid storing multiple certificates on disk unencrypted. We store them ourselves, using appropriate cryptographic means, and then decode them only once they are requested through SNI, at which point they are passed to the SSL library.

For this use case, the SSL library has certificates passed straight in memory and does not require file access. Still, the same caching code path was used. The certificates are cached but not reference-counted and also not shared across connections either. For heavy usage of DER-encoded certificates, the PEM cache, therefore, becomes a central bottleneck for the entire server, forcing the decoding of every certificate individually through that single critical process.

Patching Results

We decided to write a very simple patch whose purpose is just to bypass the cache wholesale, nothing more. Once the patch was written, we needed to test it. We ran it through our benchmark sequence and the capacity of each instance instantly tripled, with response times now much lower than before.

It soon came to the step of sending our first canary nodes into production to see how they’d react with real-world data rather than benchmark cases. It didn’t take too long to see the improvements:

The latencies on the node for all SSL/TLS queries over SNI had instantly dropped by 400% to 500% – from 100-120 milliseconds roundtrip down to a range of 20-25 milliseconds for the particular subset of apps we were looking at. That was a lot of overhead.

In fact, the canary node ran for a while and beat every other node we had on the platform:

It seemed a bit surprising that so much time was spent only through the bottleneck, especially since our benchmark case was more pessimistic than our normal production load. Then we looked at garbage collection metrics:

The big dip in the middle is the node restarting to run the new software. What’s interesting is that as soon as the new version was used, the garbage collection count went from about 3 million GCs per minute down to roughly 2.5 million GCs per minute. The reclaimed words are impacted more significantly: from around 7 billion words reclaimed down to 4 billion words reclaimed per minute.

Since Erlang garbage collections are per-process, non-blocking, and generational, it is expected to see a lot of small garbage collections and a few larger ones. The overall count includes both values. Each garbage collection tends to take a very short time.

What we know for a fact, though, is that we had this one bottleneck of a process and that a lot of time was saved. The supposition is that because a lot of requests (in fact, all of the SNI termination requests) had to touch this one Erlang process, it would have time to accumulate significant amounts of garbage for short periods of time. This garbage collection impacted the latency of this process and the scheduler it was on, but not the others; processes on other cores could keep running fine.

This yielded a pattern where a lot of data coming from the other 15 cores could pile up into the message queue of the SSL manager, up to the point that memory pressure forced it to GC more and more often. At regular intervals, a more major GC would take place, and stall an even larger number of requests.

By removing the one small bottleneck, the level of garbage generated was more equally shared by all callers and also tended to be much more short-lived. In fact, if the caller was a short-lived process, its memory would be reclaimed on termination without any garbage collection taking place.

The results were so good that we rapidly deployed them to the rest of the platform:

Those values represented the median (orange), the 95th percentile (blue), and the 99th percentile (green). You can see the progression of the deploy as the median rapidly shrinks, and the point where the deploy finishes when the 99th percentile responses became faster than our previous medians.

The patch has been written, tested, cleaned up, and sent upstream to the OTP team at Ericsson (see the pull request). It has recently been shipped along with Erlang OTP-19.1, and everyone can now use it in the releases starting then.

Handling Very Large Tables in Postgres Using Partitioning

Tue, 13 Sep 2016 15:00:00 +0000

One of the interesting patterns that we’ve seen, as a result of managing one of the largest fleets of Postgres databases, is one or two tables growing at a rate that’s much larger and faster than the rest of the tables in the database. In terms of absolute numbers, a table that grows sufficiently large is on the order of hundreds of gigabytes to terabytes in size. Typically, the data in this table tracks events in an application or is analogous to an application log. Having a table of this size isn’t a problem in and of itself, but can lead to other issues; query performance can start to degrade and indexes can take much longer to update. Maintenance tasks, such as vacuum, can also become inordinately long. Depending on how you need to work with the information being stored, Postgres table partitioning can be a great way to restore query performance and deal with large volumes of data over time without having to resort to changing to a different data store.

We use pg_partman ourselves in the Postgres database that backs the control plane that maintains the fleet of Heroku Postgres, Heroku Redis, and Heroku Kafka stores. In our control plane, we have a table that tracks all of the state transitions for any individual data store. Since we don’t need that information to stick around after a couple of weeks, we use table partitioning. This allows us to drop tables after the two week window and we can keep queries blazing fast. To understand how to get better performance with a large dataset in Postgres, we need to understand how Postgres does inheritance, how to set up table partitions manually, and then how to use the Postgres extension, pg_partman, to ease the partitioning setup and maintenance process.

Let’s Talk About Inheritance First

Postgres has basic support for table partitioning via table inheritance. Inheritance for tables in Postgres is much like inheritance in object-oriented programming. A table is said to inherit from another one when it maintains the same data definition and interface. Table inheritance for Postgres has been around for quite some time, which means the functionality has had time to mature. Let’s walk through a contrived example to illustrate how inheritance works:

CREATE TABLE products (
    id BIGSERIAL,
    price INTEGER
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ
);

CREATE TABLE books (
    isbn TEXT,
    author TEXT,
    title TEXT
) INHERITS (products);

CREATE TABLE albums (
    artist TEXT,
    length INTEGER,
    number_of_songs INTEGER
) INHERITS (products);

In this example, both books and albums inherit from products. This means that if a record was inserted into the books table, it would have all the same characteristics of the products table plus that of the books table. If a query was issued against the products table, that query would reference information on the product table plus all of its descendants. For this example, the query would reference products, books and albums. That’s the default behavior in Postgres. But, you can also issue queries against any of the child tables individually.

Setting up Partitioning Manually

Now that we have a grasp on inheritance in Postgres, we’ll set up partitioning manually. The basic premise of partitioning is that a master table exists that all other children inherit from. We’ll use the phrase ‘child table’ and partition interchangeably throughout the rest of the setup process. Data should not live on the master table at all. Instead, when data gets inserted into the master table, it gets redirected to the appropriate child partition table. This redirection is usually defined by a trigger that lives in Postgres. On top of that, CHECK constraints are put on each of the child tables so that if data were to be inserted directly on the child table, the correct information will be inserted. That way data that doesn’t belong in the partition won’t end up in there.

When doing table partitioning, you need to figure out what key will dictate how information is partitioned across the child tables. Let’s go through the process of partitioning a very large events table in our Postgres database. For an events table, time is the key that determines how to split out information. Let’s also assume that our events table gets 10 million INSERTs done in any given day and this is our original events table schema:

CREATE TABLE events (
    uuid text,
    name text,
    user_id bigint,
    account_id bigint,
    created_at timestamptz
);

Let’s make a few more assumptions to round out the example. The aggregate queries that run against the events table only have a time frame of a single day. This means our aggregations are split up by hour for any given day. Our usage of the data in the events table only spans a couple of days. After that time, we don’t query the data any more. On top of that, we have 10 million events generated a day. Given these extra assumptions, it makes sense to create daily partitions. The key that we’ll use to partition the data will be the time at which the event was created (e.g. created_at).

CREATE TABLE events (
    uuid text,
    name text,
    user_id bigint,
    account_id bigint,
    created_at timestamptz
);

CREATE TABLE events_20160801 ( 
    CHECK (created_at >= ‘2016-08-01 00:00:00’ AND created_at < ‘2016-08-02 00:00:00’)  
) INHERITS (events);

CREATE TABLE events_20160802 ( 
    CHECK (created_at >= ‘2016-08-02 00:00:00’ AND created_at < ‘2016-08-03 00:00:00’)   
) INHERITS (events);

Our master table has been defined as events and we have two tables out in the future that are ready to accept data, events_20160801 and events_20160802. We’ve also put CHECK constraints on them to make sure that only data for that day ends up on that partition. Now we need to create a trigger to make sure that any data entered on the master table gets directed to the correct partition:

CREATE OR REPLACE FUNCTION event_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF ( NEW.created_at >= ‘2016-08-01 00:00:00'AND
         NEW.created_at < ‘2016-08-02 00:00:00' ) THEN
        INSERT INTO events_20160801 VALUES (NEW.*);
    ELSIF ( NEW.created_at >= ‘2016-08-02 00:00:00'AND
         NEW.created_at < ‘2016-08-03 00:00:00' ) THEN
        INSERT INTO events_20160802 VALUES (NEW.*);
    ELSE
        RAISE EXCEPTION 'Date out of range.  Fix the event_insert_trigger() function!';
    END IF;
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER insert_event_trigger
    BEFORE INSERT ON event
    FOR EACH ROW EXECUTE PROCEDURE event_insert_trigger();

Great! The partitions have been created, the trigger function defined, and the trigger has been added to the events table. At this point, my application can insert data on the events table and the data can be directed to the appropriate partition.

Unfortunately, utilizing table partitioning is a very manual setup fraught with chances for failure. It requires us to go into the database every so often to update the partitions and the trigger, and we haven’t even talked about removing old data from the database yet. This is where pg_partman comes in.

Implementing pg_partman

pg_partman is a partition management extension for Postgres that makes the process of creating and managing table partitions easier for both time and serial-based table partition sets. Compared to partitioning a table manually, pg_partman makes it much easier to partition a table and reduce the code necessary to run partitioning outside of the database. Let’s run through an example of doing this from scratch:

First, let’s load the extension and create our events table. If you already have a big table defined, the pg_partman documentation has guidance for how to convert that table into one that’s using table partitioning.

$ heroku pg:psql -a sushi
sushi::DATABASE=> CREATE EXTENSION pg_partman;
sushi::DATABASE=> CREATE TABLE events (
  id bigint,
  name text,
  properities jsonb,
  created_at timestamptz
);

Let’s reuse our assumptions that we made about our event data we made earlier. We’ve got 10 million events that are created a day and our queries really need aggregation on a daily basis. Because of this we’re going to create daily partitions.

sushi::DATABASE=> SELECT create_parent('public.events', 'created_at', 'time', 'daily');

This command is telling pg_partman that we’re going to use time-series based partitioning, created_at is going to be the column we use for partitioning, and we want to partition on a daily basis for our master events table. Amazingly, everything that was done to manually set up partitioning is completed in this one command. But we’re not finished, we need to make sure that on regular intervals maintenance is run on the partitions so that new tables get created and old ones get removed.

sushi::DATABASE=> SELECT run_maintenance();

The run_maintenance() command will instruct pg_partman to look through all of the tables that were partitioned and identify if new partitions should be created and old partitions destroyed. Whether or not a partition should be destroyed is determined by the retention configuration options. While this command can be run via a terminal session, we need to set this up to run on a regular basis. This is a great opportunity to use Heroku Scheduler to accomplish the task.

This command will run on an hourly basis to double check the partitions in the database. Checking the partitioning on an hourly basis might be a bit overkill in this scenario but since Heroku Scheduler is a best effort service, running it hourly is not going to cause any performance impacts on the database.

That’s it! We’ve set up table partitioning in Heroku Postgres and it will be running on its own with very little maintenance on our part. This setup only scratches the surface of what’s possible with pg_partman. Check out the extension’s documentation for the details of what’s possible.

Should I Use Table Partitioning?

Table partitioning allows you to break out one very large table into many smaller tables dramatically increasing performance. As pointed out in the ‘Setting up Partitioning Manually’ section, many challenges exist when trying to create and use table partitioning on your own but pg_partman can ease that operational burden. Despite that, table partitioning shouldn’t be the first solution you reach for when you run into problems. A number of questions should be asked to determine if table partitioning is the right fit:

Do you have a sufficiently large data set stored in one table, and do you expect it to grow significantly over time?
Is the data immutable, that is, will it never updated after being initially inserted?
Have you done as much optimization as possible on the big table with indexes?
Do you have data that has little value after a period of time?
Is there a small range of data that has to be queried to get the results needed?
Can data that has little value be archived to a slower, cheaper storage medium, or can the older data be stored in aggregate or “rolled up”?

If you answered yes to all of these questions, table partitioning could make sense for you. The big caveat is that table partitioning requires you to evaluate how you’re querying your data. This is a big departure from designing a schema and optimizing it as you go, table partitioning requires you to plan ahead and consider your usage patterns. So long as you take these factors into account, table partitioning can create very real performance gains for your queries and your application.

Extending Your Postgres Installation

In situations where you have high volumes of data that has a very short shelf life, days, weeks or even months, table partitioning can make lots of sense. As always, make sure you ask how you’re going to query your data now and into the future. Table partitioning won’t handle everything for you but it will at least allow you to extend the life of your Heroku Postgres installation. If you’re using pg_partman, we’d love to hear about it. Email us at postgres@heroku.com or via Twitter @heroku.

Apache Kafka 0.10: Evaluating Performance in Distributed Systems

Fri, 27 May 2016 15:00:00 +0000

At Heroku, we’re always striving to provide the best operational experience with the services we offer. As we’ve recently launched Heroku Kafka, we were excited to help out with testing of the latest release of Apache Kafka, version 0.10, which landed earlier this week. While testing Kafka 0.10, we uncovered what seemed like a 33% throughput drop relative to the prior release. As others have noted, “it’s slow” is the hardest problem you’ll ever debug, and debugging this turned out to be very tricky indeed. We had to dig deep into Kafka’s configuration and operation to uncover what was going on.

Background

We’ve been benchmarking Heroku Kafka for some time, as we prepared for the public launch. We started out benchmarking to help provide our users with guidance on the performance of each Heroku Kafka plan. We realized we could also give back to the Kafka community by testing new versions and sharing our findings. Our benchmark system orchestrates multiple producer and consumer dynos, allowing us to fully exercise a Kafka cluster and determine its limits across the various parameters of its use.

Discovery

We started benchmarking the Kafka 0.10.0 release candidate shortly after it was made available. In the very first test we ran, we noted a 33% performance drop. Version 0.9 on the same cluster configuration provided 1.5 million messages per second in and out at peak, and version 0.10.0 was doing just under 1 million messages per second. This was pretty alarming. There could be some major disincentives to upgrade to this version with such a large reduction in throughput, if this condition were present for all users of Kafka.

We set out to determine the cause (or causes) of this decrease in throughput. We ran dozens of benchmark variations, testing a wide variety of hypotheses:

Does this only impact our largest plan? Or are all plans equally impacted?
Does this impact a single producer, or do we have to push the boundaries of the cluster to find it?
Does this impact plaintext and TLS connections equally?
Does this impact large and small messages equally?
And many other variations.

We investigated many of the possible angles suggested by our hypotheses, and turned to the community for fresh ideas to narrow down the cause. We asked the Kafka mailing list for help, reporting the issue and all the things we had tried. The community immediately dove into trying to reproduce the issue and also responded with helpful questions and pointers for things we could investigate.

During our intensive conversation with the community and review of the conversations that lead up to the 0.10 release candidate, we found this fascinating thread about another performance regression in 0.10. This issue didn’t appear to line up with the problem we had found, but it helped provide more insight into Kafka that helped us understand the root cause of our particular problem. We found this other issue to be very counter-intuitive: increasing the performance of a Kafka broker can actually negatively impact performance of the whole system. Kafka relies very heavily on batching, and if the broker becomes faster, the producers batch less often. Version 0.10 included several improvements to the broker’s performance, and that caused odd performance impacts that have since been fixed.

To help us proceed in a more effective and deliberate manner, we started applying Brendan Gregg’s USE method to a broker during benchmarks. The USE method helps structure performance investigations, and is very easy to apply, yet also very robust. Simply, it says:

Make a list of all the resources used by the system (network, CPU, disks, memory, etc)
For each resource, look at the:
1. Utilization: the average time the resource was busy servicing work
2. Saturation: the amount of extra work queued for the resource
3. Errors: the count of error events

We started going through these, one by one, and rapidly confirmed that Kafka is blazingly fast and easily capable of maxing out the performance of your hardware. Benchmarking it will quickly identify the bottlenecks in your setup.

What we soon found is that the network cards were being oversaturated during our benchmarking of version 0.10, and they were dropping packets because of the number of bytes they were asked to pass around. When we benchmarked version 0.9, the network cards were being pushed to just below their saturation point. What changed in version 0.10? Why did it lead to saturation of the networking hardware under the same conditions?

Understanding

Kafka 0.10 brings with it a few new features. One of the biggest ones, which was added to support Kafka Streams and other time-based capabilities, is that each message now has a timestamp to record when it is created. This timestamp accounts for an additional 8 bytes per message. This was the issue. Our benchmarking setup was pushing the network cards so close to saturation that an extra 8 bytes per message was the problem. Our benchmarks run with replication factor 3, so an additional 8 bytes per message is an extra 288 megabits per second of traffic over the whole cluster:

${288}\ \text{Mbps}\ \ \ \ =\ \ \ {8} \textstyle \frac{\text{bytes}}{\text{message}}\ \ \ \times\ \ \ {1.5}\ \text{million} \frac{\text{messages}}{\text{second}}\ \ \ \times\ \ \ {8} \frac{\text{bits}}{\text{byte}}\ \ \ \times\ \ \ {3}\ \tiny\text{(1 for producer, 2 for replication)}$

This extra traffic is more than enough to oversaturate the network. Once the network cards are oversaturated, they start dropping packets and doing retransmits. This dramatically reduces network throughput, as we saw in our benchmarks.

To further verify our hypothesis, we reproduced this under Kafka 0.9. When we increased the message size by 8 bytes, we saw the same performance impact.

Giving Back

Ultimately, there’s not much to do here in terms of fixing this issue by making changes to Kafka’s internals. Any Kafka cluster that runs close to the limits of its networking hardware will likely see issues of this sort, no matter what version. Kafka 0.10 just made the issue more apparent in our analyses, due to the increase in baseline message size. These issues would also happen if you as a user added a small amount of overhead to each message and were driving sufficient volume through your cluster. Production use cases tend to have a lot of variance in message size (usually a lot more than 8 bytes), so we expect most production uses of Kafka to not be impacted by the overhead in 0.10. The real trick is not to saturate your network in the first place, so it pays to model out an approximation of your data relative to your configuration.

We contributed a documentation patch about the impact of the increased network traffic so that other Kafka users don’t have to go through the same troubleshooting steps. For Heroku Kafka, we’ve been looking at a few networking improvements we can make to the underlying cluster configurations to help mitigate the impact of the additional message overhead. We’ve also been looking at improved monitoring and bandwidth recommendations to better understand the limits of possible configurations, and to be able to provide a graceful and stable operational experience with Heroku Kafka.

Kafka 0.10 is in beta on Heroku Kafka now. For those of you in the Heroku Kafka beta, you can provision a 0.10 cluster like so:

heroku addons:create heroku-kafka --version 0.10

We encourage you to check it out. Kafka Streams, added in version 0.10, makes many kinds of applications much easier to build. Kafka Streams works extremely well with Heroku, as it’s just a (very powerful) library you embed in your application.

If you aren’t on the beta, you can request access here: https://heroku.com/kafka

We would recommend that you continue to use Kafka 0.9.0.1, the default for Heroku Kafka, for production use. We are working closely with the community to further test and validate 0.10 as ready for production use. We take some time to do this, in order to iron out any bugs with new releases for our customers. The only way that happens is if people try it out with their applications (for example in a staging environment), so we welcome and encourage your use of the new version. We can’t wait to see what you build!

Heroku Metrics: There and Back Again

Thu, 26 May 2016 00:00:00 +0000

For almost two years now, the Heroku Dashboard has provided a metrics page to display information about memory usage and CPU load for all of the dynos running an application. Additionally, we’ve been providing aggregate error metrics, as well as metrics from the Heroku router about incoming requests: average and P95 response time, counts by status, etc.

Almost all of this information is being slurped out of an application’s log stream via the Log Runtime Metrics labs feature. For applications that don’t have this flag enabled, which is most applications on the platform, the relevant logs are still generated, but bypass Logplex, and are instead sent directly to our metrics processing pipeline.

Since its beta release, Dashboard Metrics has been a good product. Upgrading from good to great unearthed some interesting performance hurdles, as well as doubling our CPU requirements, which simply became bumps when we just slapped a “Powered by Kafka” sticker on it. What follows is a look back at the events and decisions that lead us to enlightenment.

Historically Speaking…

Historically speaking, if a Heroku user wanted system level metrics about their apps, they had two choices:

Setup an add-on such as New Relic, or Librato
Setup a Logplex drain, add the log-runtime-metrics flag and build their own tooling around it.

In August of 2014, a third option became available–we shipped visualizations for the past 24 hours of 10-minute roll-up data right in the dashboard!

The architecture for the initial system was quite simple. We received incoming log lines containing metrics (the same log lines a customer sees when they turn on log-runtime-metrics), and turned them into points which we would then persist to a 21 shard InfluxDB cluster, using consistent hashing. The dashboard then queried the relevant metrics via a simple API proxy layer and rendered them.

Our InfluxDB setup just worked, so well in fact, that for nearly two years the only maintenance we did was upgrade the operating system to Ubuntu Trusty! We were still running a nearly 2-year-old release! (Note: all opinions are based on the 0.7.3 release, which has long been dropped from support. InfluxDB served us incredibly well, and we are eternally grateful for their effort.)

Re-Energizing

As we sat back and collected user experiences about Dashboard Metrics, it was clear that our users wanted more. And we wanted to deliver more! We spun off a new team charged with building a better operational experience for our customers, with metrics being at the forefront.

Almost immediately we shipped 2 hours of data at 1-minute resolution, and started enhancing the events that we can show by querying internal APIs. We’ve since added restart events, scaling events, deployment markers, configuration changes, and have lots of other enhancements planned. But, as we were starting to make these changes, we were realizing that our current infrastructure wasn’t going to cut it.

We set off researching ideas for how to build our next generation metrics pipeline with a few overarching goals:

Make it extensible.
Make it robust.
Make it run on Heroku.

The last goal is most important for us. Yes, of course, we want to ensure that whatever system we put in place is tolerant against failures and that it can be extended to drive new features. But, we also want to understand what operational headaches our customers have firsthand so we can look at developing features that address them, instead of building features that look nice on paper but solve nothing in practice.

Plus, our coworkers already operate the hell out of databases and our runtimes. It’d be foolish to not leverage their skills!

An Idealist Architecture

The data ingestion path for a metrics aggregation system is the most critical. Not seeing a spike in 5xx errors during a flash sale makes businesses sad. Being blind to increased latency makes customers sad. A system which drops data on the floor due to a downstream error makes me sad.

But, in our previous system, this could happen and did happen anytime we had to restart a node. We were not robust to restarts, or crashes, or really any failures.

The New System Had to Be.

With our previous setup, when data was successfully stored in InfluxDB, that was pretty much the end of the line. Sure, one can build systems to query the data again, and we have, but with millions of constantly changing dynos, and metric streams as a result, knowing what to query all the time isn’t all that easy.

Data in the New System Should Never Rest.

Our summarization of raw data with InfluxDB relied on continuous queries, which while convenient, were fairly expensive. When we added continuous queries for 1-minute resolution, our CPU load doubled.

We Should Only Persist Rolled Up Data.

With these three properties in mind, we found stream processing, specifically with Apache Kafka, to fit our needs quite well. With Kafka as the bus between each of the following components, our architecture was set.

Ingestion: Parses Logplex frames into measurement data and commits them to Kafka
Summarization: Group measurements by app/dyno type as appropriate for the metric and computes count, sum, sum of squares, min, max for each measurement for each 1 minute of data, before committing back to Kafka.
Sink: Writes the summarized data to persistent storage, however appropriate.

Building on Heroku

We ran into a number of problems, not the least of which is the volume and size of HTTP requests we get from Logplex, and the other metrics producing systems. Pushing this through the shared Heroku router was something we wanted to avoid.

Fortunately, Private Spaces was entering beta and looking for testers. We became customers.

Overall, our architecture looks like this:

Let’s dive into the different pieces.

Private Spaces

With Private Spaces, a customer gets an isolated “mini-Heroku” into which to deploy apps. That isolation comes with an elastic routing layer, dedicated runtime instances, and access to the same exact Heroku experience one gets in the common runtime (pipelines, buildpacks, CLI tools, metrics, logging, etc). It was only natural for us to deploy our new system into a space.

Our space runs 4 different apps, all of which share the same Go code base. We share a common code base to reduce the pain of package dependencies in Go and to better ensure compatibility between services.

Metrics Ingress

By design, our ingress app is simple. It speaks the Logplex drain protocol over HTTP and converts each log line into an appropriate Protocol Buffers encoded message. Log lines representing router requests, for instance, are then in turn produced to the ROUTER_REQUESTS Kafka topic.

Yahtzee: An Aggregation Framework

We custom built our aggregation layer instead of opting to run something like Spark on the platform. The big stream processing frameworks all make assumptions about the underlying orchestration layer, and some of them even require arbitrary TCP connections, which Heroku on the whole, doesn’t currently support. Also, at the time we were making this decision, Kafka Streams was not ready for prime time, so instead we built a simple aggregation framework in Go that runs in a dyno and plays well with Heroku Kafka.

In reality, building a simple aggregation framework is pretty straightforward for our use case. Each of our consumers is a large map of accumulators that either just count stuff, or perform simple streaming computations like min, max, and sum. Orchestrating which dynos consume which partitions was done with an upfront deterministic mapping based on the number of dynos we want to run and the number of partitions we have. We trust Heroku’s ability to keep the right number of dynos running, and slightly over-provision to avoid delays.

While straightforward, there were a few hurdles we encountered building this framework, including late arrival of data and managing offsets after an aggregation process restarts.

Late Arrivals

While it’d be amazing if we can always guarantee that data would never arrive late, it’s just not possible. We regularly receive measurements as old as 2 minutes, and will count measurements up to 5 minutes old.

We do this by keeping around the last 5 summary accumulators, and flushing them on change (after a timeout, of course) to the appropriate compacted summary topic.

Offset Management

When the aggregation process restarts, the in-memory accumulators need to reflect the last state that they were in before the restart occurred. Because of this, we can’t simply start consuming from the latest offset, and it would be wrong to mark an offset that wasn’t yet committed to a summary.

Therefore, we keep track of the first offset seen for every new time frame (e.g. each minute) and mark the offset from 5 minutes ago after flushing the summaries for the current minute. This ensures that our state is rebuilt correctly, at the expense of extra topic publishes. (Did I mention our use of compacted topics?)

Postgres Sink (Another Yahtzee Component)

Each sink process consumes a number of partitions for each topic, batches them up by shard and type, and uses a Postgres COPY FROM statement to write 1024 summaries at a time. This seems to perform quite well and is certainly better performing than an INSERT per summary.

Our database schemas more or less match the layout of log-runtime-metrics data, but include additional columns like min, max, sum, sum of squares, and count for the metrics to aid in downsampling and storytelling purposes.

Metrics API

The API is a simple read-only HTTP over JSON service. In addition to selecting the appropriate data from the appropriate shards, and authenticating against the Heroku API, the Metrics API has the ability to downsample the 1-minute data on demand. This is how we continue to serve 10-minute roll-up data for the 24-hour views.

We would like to eventually expose these endpoints officially as part of the Heroku API, but that work is not currently scheduled.

Heroku Kafka

That left Kafka. Until a few weeks ago, there wasn’t really an option for getting Kafka as a service, but as Heroku insiders, we were alpha and now beta testers of Heroku Kafka long before the public beta was announced.

We don’t complicate our usage of Kafka. Dyno load average measurements, like all of the other measurement types, for instance, have 3 topics associated with them. DYNO_LOAD.<version> are raw measurements. A compacted DYNO_LOAD_SUMMARY.<version> summarizes / rolls up measurements. The rollup period is contained within the message, making it possible for us to store 1 minute, and (potentially) 15-minute rollups in the same topic if we need to. Lastly, the RETRY.DYNO_LOAD_SUMMARY.<version> topic is written to when we fail to write a summary to our Postgres sink. Another process consumes the RETRY topics and attempts to persist to Postgres indefinitely.

Each of these topics is created with 32 partitions, and a replication factor of 3, giving us a bit of wiggle room for failure. We partition messages to those 32 partitions based on the application’s UUID, which we internally call owner since a measurement is “owned” by an app.

Even though we continue to use owner for partitioning, our compacted, summary topics produce messages with a key that includes the time for which the summary was for. We do this with a custom partitioner that simply ignores the parts of the key that aren’t the owner.

Heroku Postgres

Given our plan of storing only summarized data, we felt (and still believe) that a partitioned and sharded Postgres setup would work and made sense. As such, we have 7 shards, each running on the Heroku Postgres private-4 plan. We create a new schema each day with fresh tables for each metric type. This allows us to easily reign in our retention strategy with simple DROP SCHEMA statements, which are less expensive than DELETE statements on large tables.

The owner column, as it has throughout, continues to be the shard key.

While our performance is acceptable for our query load, it’s not exactly where we’d like it to be, and feel it could be better. We will be looking at ways in which to further optimize our use of Postgres in the future.

Lessons and Considerations

No system is perfect, and ours isn’t some magical exception. We’ve learned some things in this exercise that we think are worth pointing out.

Sharding and Partitioning Strategies Matter

Our strategy of using the owner column for our shard/partitioning key was a bit unfortunate, and now hard to change. While we don’t currently see any ill effects from this, there are hypothetical situations in which this could pose a problem. For now, we have dashboards and metrics which we watch to ensure that this doesn’t happen and a lot of confidence that the systems we’ve built upon will actually handle it in stride.

Even still, a better strategy, likely, would have been to shard on owner + process_type (e.g. web), which would have spread the load more evenly across the system. In addition to the more even distribution of data, from a product perspective it would mean that in a partial outage, some of an application’s metrics would remain available.

Extensibility Comes Easily with Kafka

The performance of our Postgres cluster doesn’t worry us. As mentioned, it’s acceptable for now, but our architecture makes it trivial to swap out, or simply add another data store to increase query throughput when it becomes necessary. We can do this by spinning up another Heroku app that uses shareable addons, starts consuming the summary topics and writes them to a new data store, with no impact to the Postgres store!

Our system is more powerful and more extensible because of Kafka.

Reliability in a Lossy Log World?

While we’re pretty confident about the data once it has been committed to Kafka, the story up until then is murkier. Heroku’s logging pipeline, on which metrics continue to be built, is built on top of lossy-by-design systems. The logging team, of course, monitors this loss and maintains it at an acceptable level, but it means that we may miss some measurements from time to time. Small loss events are typically absorbed via our 1-minute rollup strategy. Larger loss events due to system outages are, historically and fortunately, very rare.

As we look to build more features on top of our metrics pipeline that require greater reliability in the underlying metrics, we’re also looking at ways in which we can ensure our metrics end up in Kafka. This isn’t the end of the discussion on reliability, but rather just the beginning.

In Conclusion?

This isn’t the end of the story, but rather just the humble beginnings. We’ll continue to evolve this system based on internal monitoring and user feedback.

In addition, we rebuilt our metrics pipeline because there are operational experiences we wish to deliver that now become dramatically easier. We’ve prototyped a few of them and hope to start delivering on them rapidly.

Finally, it goes without saying that we think Kafka is a big deal. We hope this will inspire you to wonder what types of things Kafka can enable for your apps. And, of course, we can only hope that you’ll trust Heroku to run that cluster for you!

Simulate Third-Party Downtime

Tue, 01 Mar 2016 00:00:00 +0000

I spend most of my time at Heroku working on our support tools and services; help.heroku.com is one such example. Heroku’s help application depends on the Platform API to, amongst other things, authenticate users, authorize or deny access, and fetch user data.

So, what happens to tools and services like help.heroku.com during a platform incident? They must remain available to both agents and customers—regardless of the status of the Platform API. There is simply no substitute for communication during an outage.

To ensure this is the case, we use api-maintenance-sim, an app we recently open-sourced, to regularly simulate Platform API incidents.

Simulating downtime

During a Platform API incident, the API is disabled. All requests receive a 503 (service unavailable) HTTP response. This is a simple behaviour that we can imitate on demand with api-maintenance-sim.

At its core, api-maintainenance-sim responds to every request with a 503 HTTP status as shown below.

run lambda { |env|
  [
    503,
    {"Content-Type"=>"application/json"},
    StringIO.new(%q|{ "id": "maintenance", "message": "Heroku API is temporarily unavailable.\nFor more information, visit: https://status.heroku.com" }|)
  ]
}

Once deployed, we begin the simulation by directing the app we’re testing to use a custom hostname, rather than the default api.heroku.com.

Here’s an example using the platform-api gem.

PlatformAPI.connect_oauth(current_user.oauth_token, url: ENV['PLATFORM_API_URL'])

If PLATFORM_API_URL is not configured, it will default to nil, which the gem will replace with the production URI. If it’s defined, however, you will be using the hostname of your choice.

You can now use config vars to start the simulation.

$ heroku config:set PLATFORM_API_URL=https://my-simulation-app.herokuapp.com

Conclusion

It’s not a matter of if an incident will occur; it’s a matter of when. Running regular simulations is an easy way to improve your applications stability, or at the very least, to understand what failure will mean for your application or service.

Speeding up Sprockets

Mon, 22 Feb 2016 00:00:00 +0000

The asset pipeline is the slowest part of deploying a Rails app. How slow? On average, it’s over 20x slower than installing dependencies via $ bundle install. Why so slow? In this article, we’re going to take a look at some of the reasons the asset pipeline is slow and how we were able to get a 12x performance improvement on some apps with Sprockets version 3.3+.

The Rails asset pipeline uses the sprockets library to take your raw assets such as javascript or Sass files and pre-build minified, compressed assets that are ready to be served by a production web service. The process is inherently slow. For example, compiling Sass file to CSS requires reading the file in, which involves expensive hard disk reads. Then sprockets processes it, generating a unique “fingerprint” (or digest) for the file before it compresses the file by removing whitespace, or in the case of javascript, running a minifier. All of which is fairly CPU-intensive. Assets can import other assets, so to compile one asset, for example, an app/assets/javascripts/application.js multiple files may have to be read and stored in memory. In short, sprockets consumes all three of your most valuable resources: memory, disk IO, and CPU.

Since asset compilation is expensive, the best way to get faster is not to compile. Or at least, not to compile the same assets twice. To do this effectively, we have to store metadata that sprockets needs to build an asset so we can determine which assets have changed and need to be re-compiled. Sprockets provides a cache system on disk at tmp/cache/assets. If the path and mtime haven’t changed for an asset then we can load the entire asset from disk. To accomplish this task, sprockets uses the cache to store a compiled file’s digest.

This code looks something like:

# https://github.com/rails/sprockets/blob/543a5a27190c26de8f3a1b03e18aed8da0367c63/lib/sprockets/base.rb#L46-L57

def file_digest(path)
  if stat = File.stat(path)
    cache.fetch("file_digest:#{path}:#{stat.mtime.to_i}") do
      Digest::SHA256.file(path.to_s).digest
    end
  end
end

Now that we have a file’s digest, we can use this information to load the asset. Can you spot the problem with the code above?

If you can’t, I don’t blame you—the variables are misleading. path should have been renamed absolute_path as that’s what’s passed into this method. So if you precompile your project from different directories, you’ll end up with different cache keys. Depending on the root directory where it was compiled, the same file could generate a cache key of: "file_digest:/Users/schneems/my_project/app/assets/javascripts/application.js:123456" or: "file_digest:/+Other/path/+my_project/app/assets/javascripts/application.js:123456".

There are quite a few Ruby systems deployed using Capistrano, where it’s common to upload different versions to new directories and setup symlinks so that if you need to rollback a bad deploy you only have to update symlinks. When you try to re-use a cache directory using this deploy strategy, the cache keys end up being different every time. So even when you don’t need to re-compile your assets, sprockets will go through the whole process only stopping at the very last step when it sees the file already exists:

# https://github.com/rails/sprockets/blob/543a5a27190c26de8f3a1b03e18aed8da0367c63/lib/sprockets/manifest.rb#L182-L187

if File.exist?(target)
  logger.debug "Skipping #{target}, already exists"
else
  logger.info "Writing #{target}"
  asset.write_to target
end

Sprockets 3.x+ is not using anything in the cache, and as has been reported in issue #59, unless you’re in debug mode, you wouldn’t know there’s a problem, because nothing is logged to standard out.

It turns out it’s not just an issue for people deploying via Capistrano. Every time you run a $ git push heroku master your build happens on a different temp path that is passed into the buildpack. So even though Heroku stores the cache between deploys, the keys aren’t reused.

The (almost) fix

The first fix was very straightforward. A new helper class called UnloadedAsset takes care of generating cache keys and converting absolute paths to relative ones:

UnloadedAsset.new(path, self).file_digest_key(stat.mtime.to_i)

In our previous example we would get a cache key of "file_digest:/app/assets/javascripts/application.js:123456" regardless of which directory you’re in. So we’re done, right?

As it turns out, cache keys were only part of the problem. To understand why we must look at how sprockets is using our ‘file_digest_key’.

Pulling an asset from cache

Having an asset’s digest isn’t enough. We need to make sure none of its dependencies have changed. For example, to use the jQuery library inside another javascript file, we’d use the //= require directive like:

//= require jquery
//= require ./foo.js

var magicNumber = 42;

If either jquery or foo.js change, then we must recompute our asset. This is a somewhat trivial example, but each required asset could require another asset. So if we wanted to find all dependencies, we would have to read our primary asset into memory to see what files it’s requiring and then read in all of those other files; exactly what we’re trying to avoid. So sprockets stores dependency information in the cache.

Using this cache key:

"asset-uri-cache-dependencies:#{compressed_path}:#{ @env.file_digest(filename) }"

Sprockets will return a set of “dependencies.”

#<Set: {"file-digest///Users/schneems/ruby/2.2.3/gems/jquery-rails-4.0.4", "file-digest///Users/schneems/app/assets/javascripts/foo.js"}>

To see if either of these has changed, Sprockets will pull their digests from the cache like we did with our first application.js asset. These are used to “resolve” an asset. If the resolved assets (and their dependencies) have been previously loaded and stored in the cache, then we can pull our asset from cache:

# https://github.com/rails/sprockets/blob/9ca80fe00971d45ccfacb6414c73d5ffad96275f/lib/sprockets/loader.rb#L55-L58

digest = DigestUtils.digest(resolve_dependencies(paths))
if uri_from_cache = cache.get(unloaded.digest_key(digest), true)
  asset_from_cache(UnloadedAsset.new(uri_from_cache, self).asset_key)
end

But now, our dependencies contain the full path. To fix this, we have to “compress” any absolute paths, so that if they’re relative to the root of our project we only store a relative path.

Of course, it’s never that simple.

Absolute paths everywhere

In the last section I mentioned that we would get a file digest by resolving an asset from `“file-digest///Users/schneems/app/assets/javascripts/foo.js”. That turns out to be a pretty involved process. It involves a bunch of other data from the cache, which as you guessed, can have absolute file paths. The short list includes: Asset filenames, asset URIs, load paths, and included paths, all of which we handled in Pull Request #101. But wait, we’re not finished, the list goes on: Stubbed paths, link paths, required paths (not the same as dependencies), and sass dependencies, all of which we handled in Pull Request #109, phew.

The final solution? A pattern of “compressing” URIs and absolute paths, before they were added to the cache and “expanding” them to full paths as they’re taken out. URITar was introduced to handle this compression/expansion logic.

All of this is available in Sprockets version 3.3+.

Portability for all

When tested with an example app, we saw virtually no change to the initial compile time (around 38 seconds). The second compile? 3 seconds. Roughly a 1,200% speed increase when using compiled assets and deploying using Capistrano or Heroku. Not bad.

Parts of the URITar class were not written with multiple filesystems in mind, notably Windows, which was fixed in Pull Request #125 and released in version 3.3.4. If you’re going to write code that touches the filesystems of different operating systems, remember to use a portable interface.

Into the future

Sprockets was originally authored by one prolific programmer, Josh Peek. He’s since stepped away from the project and has given maintainership to the Rails core team. Sprockets 4 is being worked on with support for source maps. If you’re running a version of Sprockets 2.x you should try to upgrade to Sprockets 3.5+, as Sprockets 3 is intended to be an upgrade path to Sprockets 4. For help upgrading see the upgrade docs in the 3.x branch.

Sprockets version 3.0 beta was released in September 2014; it took nearly a year for a bug report to come in alerting maintainers to the problem. In addition to upgrading Sprockets, I invite you to open up issues at rails/sprockets and let us know about bugs in the latest released version of Sprockets. Without bug reports and example apps to reproduce problems, we can’t make the library better.

This performance patch was much more involved than I could have imagined when I got started, but I’m very pleased with the results. I’m excited to see how this affects overall performance numbers at Heroku—hopefully you’ll be able to see some pretty good speed increases.

Thanks for reading, now go and upgrade your sprockets.

Schneems writes code for Heroku and likes working on open source performance patches. You can find him on his personal site.

Introducing React Refetch

Wed, 16 Dec 2015 00:00:00 +0000

Heroku has years of experience operating our world-class platform, and we have developed many internal tools to operate it along the way; however, with the introduction of Heroku Private Spaces, much of the infrastructure was built from the ground up and we needed new tools to operate this new platform. At the center of this, we built a new operations console to give ourselves a bird’s eye view of the entire system, be able to drill down into issues in a particular space, and everything in between.

The operations console is a single-page React application with a reverse proxy on the backend to securely access data from a variety of sources. The console itself started off from a mashup of a few different applications, all of which happened to be using React, but all three were using different methods of loading data into the components. One was manually loading data into component state with jQuery, one was using mixins to do basically the same thing, and one was using classic Flux to load data into stores via actions. We obviously needed to standardize on a way to load data, but we weren’t really happy with any of our existing methods. Loading data into state made components smarter and more mutable than they needed to be, and these problems only became worse with more data sources. We liked the general idea of unidirectional flow and division of responsibility that the Flux architecture introduced, but it also brought a lot of boilerplate and complexity with it.

Looking around for alternatives, Redux was the Flux-like library du jour, and it did seem very promising. We loved how the React Redux bindings used pure functions to select state from the store and higher-order functions to inject and bind that state and actions into otherwise stateless components. We started to move down the path of standardizing on Redux, but there was something that felt wrong about loading and reducing data into the global store only to select it back out again. This pattern makes a lot of sense when an application is actually maintaining client-side state that needs to be shared between components or cached in the browser, but when components are just loading data from a server and rendering it, it can be overkill.

Furthermore, we realized that all of our application’s state was already represented in the URL. We decided to embrace this. For something like an operations console, this is a really important property. This means that if an engineer is diagnosing an issue, he or she can send the URL to a colleague who can load it up and see the same thing. Of course, this is nothing new – this is how URLs are supposed to work by locating unique resources; however, this has been lost to a certain degree with single page applications and the shift to moving application state to the browser. Using something like React Router, it becomes very easy to keep URLs front and center, maintain state in dynamic parameters, and pass them down to components as props.

With the application’s state represented in the URL, all we needed to do was translate those props into requests to fetch the actual data from backend services. To do this, we built a new library called React Refetch. Similar to the React Redux bindings, components are wrapped in a connector and given a pure function to select data that is injected as props when the component mounts. The difference is that instead of selecting the data from the global store, React Refetch pulls the data from remote servers. The other notable difference is that because the data is automatically fetched when the component mounts, there’s no need to manually fire off actions to get the data into the store in the first place. In fact, there is no store at all. All state is maintained as immutable props, which are ultimately controlled by the URL. When the URL changes, the props change, which recalculates the requests, new data is fetched, and it is reinjected into the components. All of this is done simply and declaratively with no stores, no callbacks, no actions, no reducers, no switch statements – just a function that maps props to requests.

This is best shown with an example. Let’s say we have the following route (note, this example uses React Router and ES6 syntax, but these are not requirements):

<Route path="/users/:userId" component={Profile}/>

This passes down the userId param as a prop into the Profile component:

import React, { Component, PropTypes } from 'react'

export default class Profile extends Component {
  static propTypes = {
    params: PropTypes.shape({
      userId: PropTypes.string.isRequired,
    }).isRequired
  }

  render() {
    // TODO
  }
}

Now we know which user to load, but how do we actually load the data from the server? This is where React Refetch comes in. We simply define a pure function to map the props to a request:

(props) => ({
   userFetch: `/api/users/${props.params.userId}`
})

This is then wrapped up in connect() and we pass in our component:

export default connect((props) => ({
  userFetch: `/api/users/${props.params.userId}`
}))(Profile)

When Profile mounts, the request will automatically be calculated and the data fetched from the server. As soon as the request is fired off, the userFetch prop is injected into the component. This prop is a PromiseState, which is a composable representation of the Promise of the data at a particular point in time. While the request is still in flight, it is pending, but once the response is received, it will be either fulfilled or rejected. This makes it easy to reason about and render these different states as an atomic unit rather than a group of variables loosely connected with some naming convention. Now, we can fill in our render() function now like this:

render() {
  const { userFetch } = this.props 

  if (userFetch.pending) {
    return <LoadingAnimation/>
  } else if (userFetch.rejected) {
    return <Error error={userFetch.reason}/>
  } else if (userFetch.fulfilled) {
    return <User user={userFetch.value}/>
  }
}

If we want to display a different user, just change the URL in the browser. With React Router, this can be done with either a <Link/> or programmatically with history.pushState(). Of course manual changes to the URL work too. This will trigger the userId prop to change, React Refetch will recalculate the request, fetch new data from the server, and inject a new userFetch in the component. In this new world, state changes look like this:

This is the simplest use case of React Refetch, but it demonstrates the basic flow. The library also supports many other options such as composing multiple data sources, chaining requests, periodically refreshing data, custom headers and methods, lazy loading data, and even writing data to the server. Several of these features leverage “fetch functions” which allow the mappings to be calculated in response to user actions. Instead of requests being fired off immediately or when props change, the functions are bound to the props and can be called later with additional arguments. This is a powerful feature that provides data control to the component while still maintaining one-way data flow.

While building the operations console, we experienced a lot of trial and error learning how best to load remote data into React. Architectures like Flux and Redux can be wonderful if an application requires complex client-side state; however, if components just need to load data from a handful of sources and have no need to maintain that state in the browser after it renders, React Refetch can be a simpler alternative. React Refetch is available through npm as react-refetch, and many more examples are shown in the project’s readme.