GitHub Engineering

Orchestrator at GitHub

2016-12-08T00:00:00+00:00

GitHub uses MySQL to store its metadata: Issues, Pull Requests, comments, organizations, notifications and so forth. While git repository data does not need MySQL to exist and persist, GitHub’s service does. Authentication, API, and the website itself all require the availability of our MySQL fleet.

Our replication topologies span multiple data centers and this poses a challenge not only for availability but also for manageability and operations.

Automated failovers

We use a classic MySQL master-replicas setup, where the master is the single writer, and replicas are mainly used for read traffic. We expect our MySQL fleet to be available for writes. Placing a review, creating a new repository, adding a collaborator, all require write access to our backend database. We require the master to be available.

To that effect we employ automated master failovers. The time it would take a human to wake up & fix a failed master is beyond our expectancy of availability, and operating such a failover is sometimes non-trivial. We expect master failures to be automatically detected and recovered within 30 seconds or less, and we expect failover to result with minimal loss of available hosts.

We also expect to avoid false positives and false negatives. Failing over when there’s no failure is wasteful and should be avoided. Not failing over when failover should take place means an outage. Flapping is unacceptable. And so there must be a reliable detection mechanism that makes the right choice and takes a predictable course of action.

orchestrator

We employ Orchestrator to manage our MySQL failovers. orchestrator is an open source MySQL replication management and high availability solution. It observes MySQL replication topologies, auto-detects topology layout and changes, understands replication rules across configurations and versions, detects failure scenarios and recovers from master and intermediate master failures.

Failure detection

orchestrator takes a different approach to failure detection than the common monitoring tools. The common way to detect master failure is by observing the master: via ping, via simple port scan, via simple SELECT query. These tests all suffer from the same problem: What if there’s an error?

Network glitches can happen; the monitoring tool itself may be network partitioned. The naive solutions are along the lines of “try several times at fixed intervals, and on the n-th successive failure, assume master is failed”. While repeated polling works, they tend to lead to false positives and to increased outages: the smaller n is (or the smaller the interval is), the more potential there is for a false positive: short network glitches will cause for unjustified failovers. However larger n values (or longer poll intervals) will delay a true failure case.

A better approach employs multiple observers, all of whom, or the majority of whom must agree that the master has failed. This reduces the danger of a single observer suffering from network partitioning.

orchestrator uses a holistic approach, utilizing the replication cluster itself. The master is not an isolated entity. It has replicas. These replicas continuously poll the master for incoming changes, copy those changes and replay them. They have their own retry count/interval setup. When orchestrator looks for a failure scenario, it looks at the master and at all of its replicas. It knows what replicas to expect because it continuously observes the topology, and has a clear picture of how it looked like the moment before failure.

orchestrator seeks agreement between itself and the replicas: if orchestrator cannot reach the master, but all replicas are happily replicating and making progress, there is no failure scenario. But if the master is unreachable to orchestrator and all replicas say: “Hey! Replication is broken, we cannot reach the master”, our conclusion becomes very powerful: we haven’t just gathered input from multiple hosts. We have identified that the replication cluster is broken de-facto. The master may be alive, it may be dead, may be network partitioned; it does not matter: the cluster does not receive updates and for all practical purposes does not function. This situation is depicted in the image below:

Masters are not the only subject of failure detection: orchestrator employs similar logic to intermediate masters: replicas which happen to have further replicas of their own.

Furthermore, orchestrator also considers more complex cases as having unreachable replicas or other scenarios where decision making turns more fuzzy. In some such cases, it is still confident to proceed to failover. In others, it suffices with detection notification only.

We observe that orchestrator’s detection algorithm is very accurate. We spent a few months in testing its decision making before switching on auto-recovery.

Failover

Once the decision to failover has been made, the next step is to choose where to failover to. That decision, too, is non trivial.

In semi-sync replication environments, which orchestrator supports, one or more designated replicas are guaranteed to be most up-to-date. This allows one to guarantee one or more servers that would be ideal to be promoted. Enabling semi-sync is on our roadmap and we use asynchronous replication at this time. Some updates made to the master may never make it to any replicas, and there is no guarantee as for which replica will get the most recent updates. Choosing the most up-to-date replica means you lose the least data. However in the world of operations not all replicas are created equal: at any given time we may be experimenting with a recent MySQL release, that we’re not ready yet to put to production; or may be transitioning from STATEMENT based replication to ROW based; or have servers in a remote data center that preferably wouldn’t take writes. Or you may have a designated server of stronger hardware that you’d like to promote no matter what.

orchestrator understands all replication rules and picks a replica that makes most sense to promote based on a set of rules and the set of available servers, their configuration, their physical location and more. Depending on servers’ configuration, it is able to do a two-step promotion by first healing the topology in whatever setup is easiest, then promoting a designated or otherwise best server as master.

We build trust in the failover procedure by continuously testing failovers. We intend to write more on this in a later post.

Anti-flapping and acknowledgements

Flapping is strictly unacceptable. To that effect orchestrator is configured to only perform one automated failover for any given cluster in a preconfigured time period. Once a failover takes place, the failed cluster is marked as “blocked” from further failovers. This mark is cleared after, say, 30 minutes, or until a human says otherwise.

To clarify, an automated master failover in the middle of the night does not mean stakeholders get to sleep it over. Pages will arrive, even as failover takes place. A human will observe the state, and may or may not acknowledge the failover as justified. Once acknowledged, orchestrator forgets about that failover and is free to proceed with further failovers on that cluster should the case arise.

Topology management

There’s more than failovers to orchestrator. It allows for simplified topology management and visualization.

We have multiple clusters of differing size, that span multiple datacenters (DCs). Consider the following:

The different colors indicate different data centers, and the above topology spans three DCs. Cross-DC network has higher latency and network calls are more expensive than within the intra-DC network, and so we typically group DC servers under a designated intermediate master, aka local DC master, and reduce cross-DC network traffic. In the above instance-64bb (blue, 2nd from bottom on the right) could replicate from instance-6b44 (blue, bottom, middle) and free up some cross-DC traffic.

This design leads to more complex topologies: replication trees that go deeper than one or two levels. There are more use cases to having such topologies:

Experimenting with a newer version: to test, say, MySQL 5.7 we create a subtree of 5.7 servers, with one acting as an intermediate master. This allows us to test 5.7 replication flow and speed.
Migrating from STATEMENT based replication to ROW based replication: we again migrate slowly by creating subtrees, adding more and more nodes to those trees until they consume the entire topology.
By way of simplifying automation: a newly provisioned host, or a host restored from backup, is set to replicate from the backup server whose data was used to restore the host.
Data partitioning is achieved by incubating and splitting out new clusters, originally dangling as sub-clusters then becoming independent.

Deep nested replication topologies introduce management complexity:

All intermediate masters turn to be point of failure for their nested subtrees.
Recoveries in mixed-versions topologies or mixed-format topologies are subject to cross-version or cross-format replication constraints. Not any server can replicate from any other.
Maintenance requires careful refactoring of the topology: you can’t just take down a server to upgrade its hardware; if it serves as a local/intermediate master taking it offline would break replication on its own replicas.

orchestrator allows for easy and safe refactoring and management of such complex topologies:

It can failover dead intermediate masters, eliminating the “point of failure” problem.
Refactoring (moving replicas around the topology) is made easy via GTID or Pseudo-GTID (an application level injection of sparse GTID-like entries).
orchestrator understands replication rules and will refuse to place, say, a 5.6 server below a 5.7 server.

orchestrator also serves as the de-facto topology state/inventory indicator. It complements puppet or service discoveries configuration which imply desired state, by actually observing the existing state. State is queryable at various levels, and we employ orchestrator at some of our automation tasks.

Chatops integration

We love our chatops as they make our operations visible and accessible to our greater group of engineers. While the orchestrator service provides a web interface, we rarely use it; one’s browser is her own private command center, with no visibility to others and no history.

We rely on chatops for most operations. As a quick example of visibility we get by chatops, let’s examine a cluster:

shlomi-noach

.orc cluster sample-cluster

Hubot

host                     lag  status  version          mode  format     extra
---                      ---  ---     ---              ---   ---        ---
instance-e854             0s  ok      5.6.26-74.0-log  rw    STATEMENT  >>,P-GTID
+ instance-fadf           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
  + instance-9d3d         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-8125         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
+ instance-b982           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
+ instance-c5a7           0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-64bb         0s  ok      5.6.31-77.0-log  rw    nobinlog   P-GTID
+ instance-6b44           0s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID
  + instance-cac3     14400s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID

Say we wanted to upgrade instance-fadf to 5.6.31-77.0-log. It has two replicas attached, that I don’t want to be affected. We can:

shlomi-noach

.orc relocate-replicas instance-fadf below instance-c5a7

Hubot

instance-9d3d
instance-8125

To the effect of:

shlomi-noach

.orc cluster sample-cluster

Hubot

host                     lag  status  version          mode  format     extra
---                      ---  ---     ---              ---   ---        ---
instance-e854             0s  ok      5.6.26-74.0-log  rw    STATEMENT  >>,P-GTID
+ instance-fadf           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
+ instance-b982           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
+ instance-c5a7           0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-9d3d         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-8125         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-64bb         0s  ok      5.6.31-77.0-log  rw    nobinlog   P-GTID
+ instance-6b44           0s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID
  + instance-cac3     14400s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID

The instance is now free to be taken out of the pool.

Other actions are available to us via chatops. We can force a failover, acknowledge recoveries, query topology structure etc. orchestrator further communicates with us on chat, and notifies us in the event of a failure/recovery.

orchestrator also runs as a command-line tool, and the orchestrator service supports web API, and so can easily participate in automated tasks.

orchestrator @ GitHub

GitHub has adopted orchestrator, and will continue to improve and maintain it. The github repo will serve as the new upstream and will accept issues and pull requests from the community.

orchestrator continues to be free and open source, and is released under the Apache License 2.0.

Migrating the project to the GitHub repo had the unfortunate result of diverging from the original Outbrain repo, due to the way import paths are coupled with repo URI in golang. The two diverged repositories will not be kept in sync; and we took the opportunity to make some further diverging changes, though made sure to keep API & command line spec compatible. We’ll keep an eye for incoming Issues on the Outbrain repo.

Outbrain

It is our pleasure to acknowledge Outbrain as the original author of orchestrator. The project originated at Outbrain while seeking to manage a growing fleet of servers in three data centers. It began as a means to visualize the existing topologies, with minimal support for refactoring, and came at a time where massive hardware upgrades and datacenter changes were taking place. orchestrator was used as the tool for refactoring and for ensuring topology setups went as planned and without interruption to service, even as servers were being provisioned or retired.

Later on Pseudo-GTID was introduced to overcome the problems of unreachable/crashing/lagging intermediate masters, and shortly afterwards recoveries came into being. orchestrator was put to production in very early stages and worked on busy and sensitive systems.

Outbrain was happy to develop orchestrator as a public open source project and provided the resources to allow its development, not only to the specific benefits of the company, but also to the wider community. Outbrain authors many more open source projects, which can be found on their GitHub’s Outbrain engineering page.

We’d like to thank Outbrain for their contributions to orchestrator, as well as for their openness to having us adopt the project.

Further acknowledgements

orchestrator was later developed at Booking.com. It was brought in to improve on the existing high availability scheme. orchestrator’s flexibility allowed for simpler hardware setup and faster failovers. It was fortunate to enjoy the large MySQL setup Booking.com employs, managing various MySQL vendors, versions, configurations, running on clusters ranging from a single master to many hundreds of MySQL servers and Binlog Servers on multiple data centers. Booking.com continuously contributes to orchestrator.

We’d like to further acknowledge major community contributions made by Google/Vitess (orchestrator is the failover mechanism used by Vitess), and by Square, Inc.

We are working to release a public puppet module for orchestrator, and will edit this post once released.

Chef users, please consider this Chef cookbook by @silviabotros.

How we made diff pages three times faster

2016-12-06T00:00:00+00:00

We serve a lot of diffs here at GitHub. Because it is computationally expensive to generate and display a diff, we’ve traditionally had to apply some very conservative limits on what gets loaded. We knew we could do better, and we set out to do so.

Historical approach and problems

Before this change, we fetched diffs by asking Git for the diff between two commit objects. We would then parse the output, checking it against the various limits we had in place. At the time they were as follows:

Up to 300 files in total.
Up to 100KB of diff text per file.
Up to 1MB of diff text overall.
Up to 3,000 lines of diff text per file.
Up to 20,000 lines of diff text overall.
An overall RPC timeout of up to eight seconds, though in some places it would be adjusted to fit within the remaining time allotted to the request.

These limits were in place to both prevent excessive load on the file servers, as well as prevent the browser’s DOM from growing too large and making the web page less responsive.

In practice, our limits did a pretty good job of protecting our servers and users’ web browsers from being overloaded. But because these limits were applied in the order Git handed us back the diff text, it was possible for a diff to be truncated before we reached the interesting parts. Unfortunately, users had to fall back to command-line tools to see their changes in these cases.

Finally, we had timeouts happening far more frequently than we liked. Regardless of the size of the requested diff, we shouldn’t force the user to wait up to eight seconds before responding, and even then occasionally with an error message.

Our Goals

Our main goal was to improve the user experience around (re)viewing diffs on GitHub:

Allow users to (re)view the changes that matter, rather than just whatever appears before the diff is truncated.
Reduce request timeouts due to very large diffs.
Pave the way for previously inaccessible optimizations (e.g. avoid loading suppressed diffs).
Reduce unnecessary load on GitHub’s storage infrastructure.
Improve accuracy of diff statistics.

A new approach

To achieve the aforementioned goals, we had to come up with a new and better approach to handling large diffs. We wanted a solution that would allow us to get a high-level overview of all changes in a diff, and then load the patch texts for the individual changed files “progressively”. These discrete sections could later be assembled by the user’s browser.

But to achieve this without disrupting the user experience, our new solution also needed to be flexible enough to load and display diffs identically to how we were doing it in production to date. We wanted to verify accuracy and monitor any performance impact by running the old and new diff loading strategies in production, side-by-side, before changing to the new progressive loading strategy.

Lucky for us, Git provides an excellent plumbing command called git-diff-tree.

Diff “table of contents” with `git-diff-tree`

git-diff-tree is a low-level (plumbing) git command that can be used to compare the contents of two tree objects and output the comparison result in different ways.

The default output format is --raw, which prints a list of changed files:

> git-diff-tree --raw -r --find-renames HEAD~ HEAD
:100644 100644 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 441624ae5d2a2cd192aab3ad25d3772e428d4926 M	fileA
:100644 100644 5716ca5987cbf97d6bb54920bea6adde242d87e6 4ea306ce50a800061eaa6cd1654968900911e891 M	fileB
:100644 100644 7c4ede99d4fefc414a3f7d21ecaba1cbad40076b fb3f68e3ca24b2daf1a0575d08cd6fe993c3f287 M	fileC

Using git-diff-tree --raw we could determine what changed at a high level very quickly, without the overhead of generating patch text. We could then later paginate through this list of changes, or “deltas”, and load the exact patch data for each “page” by specifying a subset of the deltas’ paths to git-diff-tree --patch.

To better understand the obvious performance overhead of calling two git commands instead of one, and to ensure that we wouldn’t cause any regressions in the returned data, we initially focused on generating the same output as a plain call to git-diff-tree --patch, by calling git-diff-tree --raw and then feeding all returned paths back into git-diff-tree --patch.

We started a Scientist experiment which ran both algorithms in parallel, comparing accuracy and timing. This gave us detailed information on cases where results were not as expected, and allowed us to keep an eye on performance.

As expected, our new algorithm, which was replacing something that hadn’t been materially refactored in years, had many mismatches and performance was worse than before.

Most of the issues that we found were simply unexpected behaviors of the old code under certain conditions. We meticulously emulated these corner cases, until we were left only with mismatches related to rename detection in git diff.

Fetching diff text with `git-diff-pairs`

Loading the patch text from a set of deltas sounds like it should have been a pretty straightforward operation. We had the list of paths that changed, and just needed to look up the patch texts for these paths. What could possibly go wrong?

In our first attempt we loaded the diffs by passing the first 300 paths from our deltas to git-diff-tree --patch. This emulated our existing behaviour, - and we unexpectedly ran into rare mismatches. Curiously, these mismatches were all related to renames, but only when multiple files containing the same or very similar contents got renamed in the same diff.

This happened because rename detection in git is based on the contents of the tree that it is operating on, and by looking at only a subset of the original tree, git’s rename detection was failing to match renames as expected.

To preserve the rename associations from the initial git-diff-tree --raw run, @peff added a git-diff-pairs command to our fork of Git. Provided a set of blob object IDs (provided by the deltas) it returns the corresponding diff text, exactly what we needed.

On a high level, the process for generating a diff in Git is as follows:

Do a tree-wide diff, generating modified pairs, or added/deleted paths (which are just considered pairs with a null before/after state).
Run various algorithms on the whole set of pairs, like rename detection. This is just linking up adds and deletes of similar content.
For each pair, output it in the appropriate format (we’re interested in --patch, obviously).

git-diff-pairs lets you take the output from step 2, and feed it individually into step 3.

With this new function in place, we were finally able to get our performance and accuracy to a point where we could transparently switch to this new diff method without negative user impact.

If you’re interested in viewing or contributing to the source for git-diff-pairs we submitted it upstream here.

Change statistics with `git-diff-tree --numstat --shortstat`

GitHub displays line change statistics for both the entire diff and each delta. Generating the line change statistics for a diff can be a very costly operation, depending on the size and contents of the diff. However, it is very useful to have summary statistics on a diff at a glance so that the user can have a good overview of the changes involved.

Historically we counted the changes in the patch text as we processed it so that only one diff operation would need to run to display a diff. This operation and its results were cached so performance was optimal. However, in the case of truncated diffs there were changes that were never seen and therefore not included in these statistics. This was done to give us better performance at the cost of slightly inaccurate total counts for large diffs.

With our move to progressive diffs, it would become increasingly likely that we would only ever be looking at a part of the diff at any one time so the counts would be inaccurate most of the time instead of rarely.

To address this problem we decided to collect the statistics for the entire diff using git-diff-tree --numstat --shortstat. This would not only solve the problem of dealing with partial diffs, but also make the counts accurate in cases where they would have been incorrect before.

The downside of this change is that Git was now potentially running the entire diff twice. We determined this was acceptable, however as the remaining diff processing for presentation was far more resource intensive. Also, with progressive diffs, it was entirely probable that many larger diffs would never have the second pass since those deltas might never be loaded anyway.

Due to the nature of how git-diff-tree works, we were even able to combine the call for these statistics with the call for deltas into a single command, to further improve performance. This is because Git already needed to perform a full diff in order to determine what the statistics were, so having it also print the tree diff information is essentially free.

Patches in batches: a whole new diff

For the initial request of a page containing a diff, we first fetched the deltas along with the diff statistics. Next we fetched as much diff text as we could, but with significantly reduced limits compared to before.

To determine optimal limits, we turned to some of our copious internal metrics. We wanted results as quickly as possible, but we also wanted a solution which would display the full diff in “most” cases. Some of the information our metrics revealed was:

81% of viewed diffs contain changes to fewer than ten files.
52% of viewed diffs contain only changes to one or two files.
80% of viewed diffs have fewer than 20KB of patch text.
90% of viewed diffs have fewer than 1000 lines of patch text.

From these, it was clear a great number of diffs only involved a handful of changes. If we set our new limits with these metrics in mind, we could continue to be very fast in most cases while significantly improving performance in previously slow or inaccessible diffs.

In the end, we settled on the following for the initial request for a diff page:

Up to 400 lines of diff text.
Up to 20KB of diff text.
A request cycle dependent timeout.
A maximum individual patch size of 400 lines or 20KB.

This allowed the initial request for a large diff to be much faster, and the rest of the diff to automatically load after the first batch of patches was already rendered.

After one of the limits on patch text was reached during asynchronous batch loading, we simply render the deltas without their diff text and a “load diff” button to retrieve the patch as needed.

Overall, the effective limits we enforce for the entire diff became:

Up to 3,000 files.
Up to 60,000,000 lines (not loaded automatically).
Up to 3GB of diff text (also not loaded automatically).

With these changes, you got more of the diff you needed in less time than ever before. Of course, viewing a 60,000,000 line diff would require the user to press the “load diff” button more than a couple thousand times.

The benefits to this approach were a clear win. The number of diff timeouts dropped almost immediately.

Additionally, the higher percentile performance of our main diffs pages improved by nearly 3x!

Our diff pages pages were traditionally among our worst performing, so the performance win was even noticeable on our high percentile graph for overall requests’ performance across the entire site, shaving off around 3.5s from the 99.9th percentile:

Looking to the future

This new approach opens the door to new types of optimizations and interface ideas that weren’t possible before. We’ll be continuing to improve how we fetch and render diffs, making them more useful and responsive.

GLB part 2: HAProxy zero-downtime, zero-delay reloads with multibinder

2016-12-01T00:00:00+00:00

Recently we introduced GLB, the GitHub Load Balancer that powers GitHub.com. The GLB proxy tier, which handles TCP connection and TLS termination is powered by HAProxy, a reliable and high performance TCP and HTTP proxy daemon. As part of the design of GLB, we set out to solve a few of the common issues found when using HAProxy at scale.

Prior to GLB, each host ran a single monolithic instance of HAProxy for all our public services, with frontends for each external IP set, and backends for each backing service. With the number of services we run, this became unwieldy, our configuration was over one thousand lines long with many interdependent ACLs and no modularization. Migrating to GLB we decided to split the configuration per-service and support running multiple isolated load balancer instances on a single machine. Additionally, we wanted to be able to update a single HAProxy configuration easily without any downtime, additional latency on connections or disrupting any other HAProxy instance on the host. Today we are releasing our solution to this problem, multibinder.

HAProxy almost-safe reloads

HAProxy uses the SO_REUSEPORT socket option, which allows multiple processes to create LISTEN sockets on the same IP/port combination. The Linux kernel then balances new connections between all available LISTEN sockets. In this diagram, we see the initial stage of an HAProxy reload starting with a single process (left) and then causing a second process to start (right) which binds to the same IP and port, but with a different socket:

This works great so far, until the original process terminates. HAProxy sends a signal to the original process stating that the new process is now accept()ing and handling connections (left), which causes it to stop accepting new connections and close its own socket before eventually exiting once all connections complete (right):

Unfortunately there’s a small period between when this process last calls accept() and when it calls close() where the kernel will still route some new connections to the original socket. The code then blindly continues to close the socket, and all connections that were queued up in that LISTEN socket get discarded (because accept() is never called for them):

For small scale sites, the chance of a new connection arriving in the few microseconds between these calls is very low. Unfortunately at the scale we run HAProxy, a customer impacting number of connections would hit this issue each and every time we reload HAProxy. Previously we used the official solution offered by HAProxy, dropping SYN packets during this small window, causing the client to retry the SYN packet shortly afterwards. Other potential solutions to the same problem include using tc qdisc to stall the SYN packets as they come in, and then un-stall the queue once the reload is complete. During development of GLB, we weren’t satisfied with either solution and sought out one without any queue delays and sharing of the same LISTEN socket.

Supporting zero-downtime, zero-delay reloads

The way other services typically support zero-downtime reloads is to share a LISTEN socket, usually by having a parent process that holds the socket open and fork()s the service when it needs to reload, leaving the socket open for the new process to consume. This creates a slightly different situation, where the kernel has a single LISTEN socket and clients are queued for accept() by either process. The file descriptors in each process may be different, but they will point to the same in-kernel socket structure.

In this scenario, a new process would be started that inherits the same LISTEN socket (left), and when the original pid stops calling accept(), connections remain queued for the new process to process because the kernel LISTEN socket and queue are shared (right):

Unfortunately, HAProxy doesn’t support this method directly. We considered patching HAProxy to add built-in support but found that the architecture of HAProxy favours process isolation and non-dynamic configuration, making it a non-trivial architectural change. Instead, we created multibinder to solve this problem generically for any daemon that needs zero-downtime reload capabilities, and integrated it with HAProxy by using a few tricks with existing HAProxy configuration directives to get the same result.

Multibinder is similar to other file-descriptor sharing services such as einhorn, except that it runs as an isolated service and process tree on the system, managed by your usual process manager. The actual service, in this case HAProxy, runs separately as another service, rather than as a child process. When HAProxy is started, a small wrapper script calls out to multibinder and requests the existing LISTEN socket to be sent using Ancillary Data over an UNIX Domain Socket. The flow looks something like the following:

Once the socket is provided to the HAProxy wrapper, it leaves the LISTEN socket in the file descriptor table and writes out the HAProxy configuration file from an ERB template, injecting the file descriptors using file descriptor binds like fd@N (where N is the file descriptor received from multibinder), then calls exec() to launch HAProxy which uses the provided file descriptor rather than creating a new socket, thus inheriting the same LISTEN socket. From here, we get the ideal setup where the original HAProxy process can stop calling accept() and connections simply queue up for the new process to handle.

Example & multiple instances

Along with the release of multibinder, we’re also providing examples of running multiple HAProxy instances with multibinder leveraging systemd service templates. Following these instructions you can launch a set of HAProxy servers using separate configuration files, each using the same system-wide multibinder instance to request their binds and having true zero-downtime, zero-delay reloads.

octocatalog-diff: GitHub’s Puppet development and testing tool

2016-10-20T00:00:00+00:00

Today we are announcing the open source release of octocatalog-diff: GitHub’s Puppet development and testing tool.

GitHub uses Puppet to configure the infrastructure that powers GitHub.com, comprised of hundreds of roles deployed on thousands of nodes. Each change to Puppet code must be validated to ensure not only that it serves the intended purpose for the role at hand, but also to avoid causing unexpected side effects on other roles. GitHub employs automated Continuous Integration testing and manual deployment testing for Puppet code changes, but it can be time-consuming to complete the manual deployment testing across hundreds of roles.

Recently, GitHub has been using an internally-developed tool called octocatalog-diff to help reduce the time required for these testing cycles. With this tool, developers are able to preview the effects of their change across all roles via a distributed “catalog difference” test that takes less than three minutes to run. Because of reduced testing cycles and increased confidence in their deployments, developers can iterate much faster on their Puppet code changes.

Before demonstrating octocatalog-diff, let’s address the existing solutions and the reasoning for creating a new tool.

Existing landscape of Puppet testing

There are three main strategies for Puppet code testing in wide use, and GitHub uses all of them:

Deployment testing – actually running the Puppet agent (possibly with --noop to preview actions without actually making changes) allows the developer to review log files or examine the system to see if the results are as intended.
Automated testing – this may include unit tests with rspec-puppet, acceptance tests with beaker, syntax checking puppet parser validate or linting with puppet-lint. These types of tests often run in a Continuous Integration environment to verify that the code meets a set of specified criteria.
Catalog testing – octocatalog-diff and Puppet’s catalog_preview module both allow comparison of catalogs produced by two different Puppet versions or between two environments.

GitHub needed a catalog testing approach that could run from a development or CI environment, because for security reasons only a small number of engineers have direct access to the Puppet master. Because catalog_preview is designed to be fully integrated into the Puppet master, it would be inaccessible for a large portion of GitHub’s Puppet contributors, and as such it was not the right fit. Therefore, we embarked upon our own development of a tool that could run independently of a Puppet installation, and produced octocatalog-diff.

`octocatalog-diff`

This screen shot shows octocatalog-diff in action, as run from a developer’s workstation:

In this example, the developer is comparing the Puppet catalog changes between the master branch and the Puppet code in the current working directory. Two resources are being created (an Exec resource to create the mount point, and a Filesystem resource to format /dev/xvdf). Two resources are being removed (the old Exec resource to change permissions on the work directory, and the old Filesystem on /dev/xvdb). And one resource is being changed (several parameters of the mount point are being updated).

The output was generated in under 15 seconds, obviating a traditional workflow of committing code, waiting for CI jobs to pass, deploying code to a node, and reviewing the results. The process that generated this output did not require access to, or put any load on, either the Puppet master or the node whose catalog was computed.

The next graphic shows output from octocatalog-diff when run via a distributed CI job, to preview the effects of a code change on nodes across the fleet:

In this example, the developer wishes to see which systems will be affected by a particular change to the Puppet code. The output from octocatalog-diff reveals that the changes affect certain GitHub API nodes. The developer can use this information to test deployment on just those six representative systems instead of hundreds or thousands of nodes. This cuts down on unnecessary testing and provides confidence that there will not be unexpected side effects, allowing the developer to complete the work more efficiently and with less risk.

`octocatalog-diff` key features

octocatalog-diff has several useful features:

Comparing catalogs generated by two branches of a Git repository
Predicting differences due to fact changes by allowing the developer to override facts
Comparing the content differences of static files, not just the path differences
Caching base catalogs to allow subsequent runs to complete faster
Ignoring selected types, titles, or parameters to suppress meaningless or known changes

octocatalog-diff is able to compare catalogs obtained in the following ways:

Compiling a catalog from Puppet code (the most common use case)
Reading in a JSON file containing a compiled catalog
Retrieving the last known catalog for a node from PuppetDB
Querying the Puppet master server for the catalog via its API

`octocatalog-diff` at GitHub

octocatalog-diff is being used in “catalog only” mode as a Continuous Integration (CI) job in GitHub’s Puppet repository. Upon every push, octocatalog-diff compiles the catalogs for over 50 critical roles, using real node names and facts, to ensure that changes to one role do not unexpectedly break Puppet catalogs for other roles. In addition, developers use octocatalog-diff in “difference” mode to preview their changes across the fleet, which has enabled them to perform major refactoring with minimal risk.

Over the past year, GitHub has successfully upgraded from Puppet 3.4 to 4.5, migrated hard-coded parameters from thousands of manifests into the hiera hierarchical data store, transitioned node classification from hostname regular expressions to application and roles, expanded roles to run in different environments and in containers, and upgraded roles to run under new operating systems. Using octocatalog-diff to predict changes across the fleet, a relatively small number of developers accomplished these substantial initiatives quickly and without their Puppet changes causing outages.

Open source

octocatalog-diff is released to the open source community under the MIT license.

While we find octocatalog-diff to be reliable for our needs, there are undoubtedly configurations or customizations within others’ Puppet code bases that we have not anticipated. We welcome community participation and contributions, and look forward to enhancing the compatibility and functionality of the tool.

Acknowledgements

We acknowledge and thank the Site Reliability Engineering team at GitHub for their suggestions and code reviews, and the other engineers at GitHub who worked patiently with us to diagnose problems and test improvements during the pre-production stages.

Introducing the GitHub Load Balancer

2016-09-22T00:00:00+00:00

At GitHub we serve billions of HTTP, Git and SSH connections each day. To get the best performance we run on bare metal hardware. Historically one of the more complex components has been our load balancing tier. Traditionally we scaled this vertically, running a small set of very large machines running haproxy, and using a very specific hardware configuration allowing dedicated 10G link failover. Eventually we needed a solution that was scalable and we set out to create a load balancer solution that would run on commodity hardware in our typical data center configuration.

Over the last year we’ve developed our new load balancer, called GLB (GitHub Load Balancer). Today, and over the next few weeks, we will be sharing the design and releasing its components as open source software.

Out with the old, in with the new

GitHub is growing and our monolithic, vertically scaled load balancer tier had met its match and a new approach was required. Our original design was based around a small number of large machines each with dedicated links to our network spine. This design tied networking gear, the load balancing hosts and load balancer configuration together in such a way that scaling horizontally was deemed too difficult. We set out to find a better way.

We first identified the goals of the new system, design pitfalls of the existing system and prior art that we could draw experience and inspiration from. After some time we determined that the following would produce a successful load balancing tier that we could maintain into the future:

Runs on commodity hardware
Scales horizontally
Supports high availability, avoids breaking TCP connections during normal operation and failover
Supports connection draining
Per service load balancing, with support for multiple services per load balancer host
Can be iterated on and deployed like normal software
Testable at each layer, not just integration tests
Built for multiple POPs and data centers
Resilient to typical DDoS attacks, and tools to help mitigate new attacks

Design

To achieve these goals we needed to rethink the relationship between IP addresses and hosts, the constituent layers of our load balancing tier and how connections are routed, controlled and terminated.

Stretching an IP

In a typical setup, you assign a single public facing IP address to a single physical machine. DNS can then be used to split traffic over multiple IPs, letting you shard traffic across multiple servers. Unfortunately, DNS entries are cached fairly aggressively (often ignoring the TTL), and some of our users may specifically whitelist or hardcode IP addresses. Additionally, we offer a certain set of IPs for our Pages service which customers can use directly for their apex domain. Rather than relying on adding additional IPs to increase capacity, and having an IP address fail when the single server failed, we wanted a solution that would allow a single IP address to be served by multiple physical machines.

Routers have a feature called Equal-Cost Multi-Path (ECMP) routing, which is designed to split traffic destined for a single IP across multiple links of equal cost. ECMP works by hashing certain components of an incoming packet such as the source and destination IP addresses and ports. By using a consistent hash for this, subsequent packets that are part of the same TCP flow will hash to the same path, avoiding out of order packets and maintaining session affinity.

This works great for routing packets across multiple paths to the same physical destination server. Where it gets interesting is when you use ECMP to split traffic destined for a single IP across multiple physical servers, each of which terminate TCP connections but share no state, like in a load balancer. When one of these servers fails or is taken out of rotation and is removed from the ECMP server set a rehash event occurs. 1/N connections will get reassigned to the remaining servers. Since these servers don’t share connection state these connections get terminated. Unfortunately, these connections may not be the same 1/N connections that were mapped to the failing server. Additionally, there is no way to gracefully remove a server for maintenance without also disrupting 1/N active connections.

L4/L7 split design

A pattern that has been used by other projects is to split the load balancers into a L4 and L7 tier. At the L4 tier, the routers use ECMP to shard traffic using consistent hashing to a set of L4 load balancers - typically using software like ipvs/LVS. LVS keeps connection state, and optionally syncs connection state with multicast to other L4 nodes, and forwards traffic to the L7 tier which runs software such as haproxy. We call the L4 tier “director” hosts since they direct traffic flow, and the L7 tier “proxy” hosts, since they proxy connections to backend servers.

This L4/L7 split has an interesting benefit: the proxy tier nodes can now be removed from rotation by gracefully draining existing connections, since the connection state on the director nodes will keep existing connections mapped to their existing proxy server, even after they are removed from rotation for new connections. Additionally, the proxy tier tends to be the one that requires more upkeep due to frequent configuration changes, upgrades and scaling so this works to our advantage.

If the multicast connection syncing is used, then the L4 load balancer nodes handle failure slightly more gracefully, since once a connection has been synced to the other L4 nodes, the connection will no longer be disrupted. Without connection syncing, providing the director nodes hash connections the same way and have the same backend set, connections may successfully continue over a director node failure. In practise, most installations of this tiered design just accept connection disruption under node failure or node maintenance.

Unfortunately, using LVS for the director tier has some significant drawbacks. Firstly, multicast was not something we wanted to support, so we would be relying on the nodes having the same view of the world, and having consistent hashing to the backend nodes. Without connection syncing, certain events, including planned maintenance of nodes, could cause connection disruption. Connection disruption is something we wanted to avoid due to how git cannot retry or resume if the connection is severed mid-flight. Finally, the fact that the director tier requires connection state at all adds an extra complexity to DDoS mitigation such as synsanity - to avoid resource exhaustion, syncookies would now need to be generated on the director nodes, despite the fact that the connections themselves are terminated on the proxy nodes.

Designing a better director

We decided early on in the design of our load balancer that we wanted to improve on the common pattern for the director tier. We set out to design a new director tier that was stateless and allowed both director and proxy nodes to be gracefully removed from rotation without disruption to users wherever possible. Users live in countries with less than ideal internet connectivity, and it was important to us that long running clones of reasonably sized repositories would not fail during planned maintenance within a reasonable time limit.

The design we settled on, and now use in production, is a variant of Rendezvous hashing that supports constant time lookups. We start by storing each proxy host and assign a state. These states handle the connection draining aspect of our design goals and will be discussed further in a future post. We then generate a single, fixed-size forwarding table and fill each row with a set of proxy servers using the ordering component of Rendezvous hashing. This table, along with the proxy states, are sent to all director servers and kept in sync as proxies come and go. When a TCP packet arrives on the director, we hash the source IP to generate consistent index into the forwarding table. We then encapsulate the packet inside another IP packet (actually Foo-over-UDP) destined to the internal IP of the proxy server, and send it over the network. The proxy server receives the encapsulated packet, decapsulates it, and processes the original packet locally. Any outgoing packets use Direct Server Return, meaning packets destined to the client egress directly to the client, completely bypassing the director tier.

Stay tuned

Now that you have a taste of the system that processed and routed the request to this blog post we hope you stay tuned for future posts describing our director design in depth, improving haproxy hot configuration reloads and how we managed to migrate to the new system without anyone noticing.

The GitHub GraphQL API

2016-09-14T00:00:00+00:00

GitHub announced a public API one month after the site launched. We’ve evolved this platform through three versions, adhering to RFC standards and embracing new design patterns to provide a clear and consistent interface. We’ve often heard that our REST API was an inspiration for other companies; countless tutorials refer to our endpoints. Today, we’re excited to announce our biggest change to the API since we snubbed XML in favor of JSON: we’re making the GitHub API available through GraphQL.

GraphQL is, at its core, a specification for a data querying language. We’d like to talk a bit about GraphQL, including the problems we believe it solves and the opportunities it provides to integrators.

Why?

You may be wondering why we chose to start supporting GraphQL. Our API was designed to be RESTful and hypermedia-driven. We’re fortunate to have dozens of different open-source clients written in a plethora of languages. Businesses grew around these endpoints.

Like most technology, REST is not perfect and has some drawbacks. Our ambition to change our API focused on solving two problems.

The first was scalability. The REST API is responsible for over 60% of the requests made to our database tier. This is partly because, by its nature, hypermedia navigation requires a client to repeatedly communicate with a server so that it can get all the information it needs. Our responses were bloated and filled with all sorts of *_url hints in the JSON responses to help people continue to navigate through the API to get what they needed. Despite all the information we provided, we heard from integrators that our REST API also wasn’t very flexible. It sometimes required two or three separate calls to assemble a complete view of a resource. It seemed like our responses simultaneously sent too much data and didn’t include data that consumers needed.

As we began to audit our endpoints in preparation for an APIv4, we encountered our second problem. We wanted to collect some meta-information about our endpoints. For example, we wanted to identify the OAuth scopes required for each endpoint. We wanted to be smarter about how our resources were paginated. We wanted assurances of type-safety for user-supplied parameters. We wanted to generate documentation from our code. We wanted to generate clients instead of manually supplying patches to our Octokit suite. We studied a variety of API specifications built to make some of this easier, but we found that none of the standards totally matched our requirements.

And then we learned about GraphQL.

The switch

GraphQL is a querying language developed by Facebook over the course of several years. In essence, you construct your request by defining the resources you want. You send this via a POST to a server, and the response matches the format of your request.

For example, say you wanted to fetch just a few attributes off of a user. Your GraphQL query might look like this:

{
  viewer {
    login
    bio
    location
    isBountyHunter
  }
}

And the response back might look like this:

{
  "data": {
    "viewer": {
      "login": "octocat",
      "bio": "I've been around the world, from London to the Bay.",
      "location": "San Francisco, CA",
      "isBountyHunter": true
    }
  }
}

You can see that the keys and values in the JSON response match right up with the terms in the query string.

What if you wanted something more complicated? Let’s say you wanted to know how many repositories you’ve starred. You also want to get the names of your first three repositories, as well as their total number of stars, total number of forks, total number of watchers, and total number of open issues. That query might look like this:

{
  viewer {
    login
    starredRepositories {
      totalCount
    }
    repositories(first: 3) {
      edges {
        node {
          name
          stargazers {
            totalCount
          }
          forks {
            totalCount
          }
          watchers {
            totalCount
          }
          issues(states:[OPEN]) {
            totalCount
          }
        }
      }
    }
  }
}

The response from our API might be:

{  
  "data":{  
    "viewer":{  
      "login": "octocat",
      "starredRepositories": {
        "totalCount": 131
      },
      "repositories":{
        "edges":[
          {
            "node":{
              "name":"octokit.rb",
              "stargazers":{
                "totalCount": 17
              },
              "forks":{
                "totalCount": 3
              },
              "watchers":{
                "totalCount": 3
              },
              "issues": {
                "totalCount": 1
              }
            }
          },
          {  
            "node":{  
              "name":"octokit.objc",
              "stargazers":{
                "totalCount": 2
              },
              "forks":{
                "totalCount": 0
              },
              "watchers":{
                "totalCount": 1
              },
              "issues": {
                "totalCount": 10
              }
            }
          },
          {
            "node":{
              "name":"octokit.net",
              "stargazers":{
                "totalCount": 19
              },
              "forks":{
                "totalCount": 7
              },
              "watchers":{
                "totalCount": 3
              },
              "issues": {
                "totalCount": 4
              }
            }
          }
        ]
      }
    }
  }
}

You just made one request to fetch all the data you wanted.

This type of design enables clients where smaller payload sizes are essential. For example, a mobile app could simplify its requests by only asking for the data it needs. This enables new possibilities and workflows that are freed from the limitations of downloading and parsing massive JSON blobs.

Query analysis is something that we’re also exploring with. Based on the resources that are requested, we can start providing more intelligent information to clients. For example, say you’ve made the following request:

{
  viewer {
    login
    email
  }
}

Before executing the request, the GraphQL server notes that you’re trying to get the email field. If your client is misconfigured, a response back from our server might look like this:

{
  "data": {
    "viewer": {
      "login": "octocat"
    }
  },
  "errors": [
    {
      "message": "Your token has not been granted the required scopes to
      execute this query. The 'email' field requires one of the following
      scopes: ['user'], but your token has only been granted the: ['gist']
      scopes. Please modify your token's scopes at: https://github.com/settings/tokens."
    }
  ]
}

This could be beneficial for users concerned about the OAuth scopes required by integrators. Insight into the scopes required could ensure that only the appropriate types are being requested.

There are several other features of GraphQL that we hope to make available to clients, such as:

The ability to batch requests, where you can define dependencies between two separate queries and fetch data efficiently.
The ability to create subscriptions, where your client can receive new data when it becomes available.
The ability to defer data, where you choose to mark a part of your response as time-insensitive.

Defining the schema

In order to determine if GraphQL really was a technology we wanted to embrace, we formed a small team within the broader Platform organization and went looking for a feature on the site we wanted to build using GraphQL. We decided that implementing emoji reactions on comments was concise enough to try and port to GraphQL. Choosing a subset of the site to power with GraphQL required us to model a complete workflow and focus on building the new objects and types that defined our GraphQL schema. For example, we started by constructing a user in our schema, moved on to a repository, and then expanded to issues within a repository. Over time, we grew the schema to encapsulate all the actions necessary for modeling reactions.

We found implementing a GraphQL server to be very straightforward. The Spec is clearly written and succinctly describes the behaviors of various parts of a schema. GraphQL has a type system that forces the server to be unambiguous about requests it receives and responses it produces. You define a schema, describing the objects that represent your resources, fields on those objects, and the connections between various objects. For example, a Repository object has a non-null String field for the name. A repository also has watchers, which is a connection to another non-nullable object, User.

Although the initial team exploring GraphQL worked mostly on the backend, we had several allies on the frontend who were also interested in GraphQL, and, specifically, moving parts of GitHub to use Relay. They too were seeking better ways to access user data and present it more efficiently on the website. We began to work together to continue finding portions of the site that would be easy to communicate with via our nascent GraphQL schema. We decided to begin transforming some of our social features, such as the profile page, the stars counter, and the ability to watch repositories. These initial explorations paved the way to placing GraphQL in production. (That’s right! We’ve been running GraphQL in production for some time now.) As time went on, we began to get a bit more ambitious: we ported over some of the Git commit history pages to GraphQL and used Scientist to identify any potential discrepancies.

Drawing off our experiences in supporting the REST API, we worked quickly to implement our existing services to work with GraphQL. This included setting up logging requests and reporting exceptions, OAuth and AuthZ access, rate limiting, and providing helpful error responses. We tested our schema to ensure that every part of was documented and we wrote linters to ensure that our naming structure was standardized.

Open source

We work primarily in Ruby, and we were grateful for the existing gems supporting GraphQL. We used the rmosolgo/graphql-ruby gem to implement the entirety of our schema. We also incorporated the Shopify/graphql-batch gem to ensure that multiple records and relationships were fetched efficiently.

Our frontend and backend engineers were also able to contribute to these gems as we experimented with them. We’re thankful to the maintainers for their very quick work in accepting our patches. To that end, we’d like to humbly offer a couple of our own open source projects:

github/graphql-client, a client that can be integrated into Rails for rendering GraphQL-backed views.
github/github-graphql-rails-example, a small app built with Rails that demonstrates how you might interact with our GraphQL schema.

We’re going to continue to extract more parts of our system that we’ve developed internally and release them as open source software, such as our loaders that efficiently batch ActiveRecord requests.

The future

The move to GraphQL marks a larger shift in our Platform strategy to be more transparent and more flexible. Over the next year, we’re going to keep iterating on our schema to bring it out of Early Access and into a wider production readiness.

Since our application engineers are using the same GraphQL platform that we’re making available to our integrators, this provides us with the opportunity to ship UI features in conjunction with API access. Our new Projects feature is a good example of this: the UI on the site is powered by GraphQL, and you can already use the feature programmatically. Using GraphQL on the frontend and backend eliminates the gap between what we release and what you can consume. We really look forward to making more of these simultaneous releases.

GraphQL represents a massive leap forward for API development. Type safety, introspection, generated documentation, and predictable responses benefit both the maintainers and consumers of our platform. We’re looking forward to our new era of a GraphQL-backed platform, and we hope that you do, too!

If you’d like to get started with GraphQL—including our new GraphQL Explorer that lets you make :sparkles:live queries:sparkles:, check out our developer documentation!

Building resilience in Spokes

2016-09-07T00:00:00+00:00

Spokes is the replication system for the file servers where we store over 38 million Git repositories and over 36 million gists. It keeps at least three copies of every repository and every gist so that we can provide durable, highly available access to content even when servers and networks fail. Spokes uses a combination of Git and rsync to replicate, repair, and rebalance repositories.

What is Spokes?

Before we get into the topic at hand—building resilience—we have a new name to announce: DGit is now Spokes.

Earlier this year, we announced “DGit” or “Distributed Git,” our application-level replication system for Git. We got feedback that the name “DGit” wasn’t very distinct and could cause confusion with the Git project itself. So we have decided to rename the system Spokes.

Defining resilience

In any system or service, there are two key ways to measure resilience: availability and durability. A system’s availability is the fraction of the time it can provide the service it was designed to provide. Can it serve content? Can it accept writes? Availability can be partial, complete, or degraded: is every repository available? Are some repositories—or whole servers—slow?

A system’s durability is its resistance to permanent data loss. Once the system has accepted a write—a push, a merge, an edit through the website, new-repository creation, etc.—it should never corrupt or revert that content. The key here is the moment that the system accepts the write: how many copies are stored, and where? Enough copies must be stored for us to believe with some very high probability that the write will not be lost.

A system can be durable but not available. For example, if a system can’t make the minimum required number of copies of an incoming write, it might refuse to accept writes. Such a system would be temporarily unavailable for writing, while maintaining the promise not to lose data. Of course, it is also possible for a system to be available without being durable, for example, by accepting writes whether or not they can be committed safely.

Readers may recognize this as related to the CAP Theorem. In short, a system can satisfy at most two of these three properties:

consistency: all nodes see the same data
availability: the system can satisfy read and write requests
partition tolerance: the system works even when nodes are down or unable to communicate

Spokes puts the highest priority on consistency and partition tolerance. In worst-case failure scenarios, it will refuse to accept writes that it cannot commit, synchronously, to at least two replicas.

Availability

Spokes’s availability is a function of the availability of underlying servers and networks, and of our ability to detect and route around server and network problems.

Individual servers become unavailable pretty frequently. Since rolling out Spokes this past spring, we have had individual servers crash due to a kernel deadlock and faulty RAM chips. Sometimes servers provide degraded service due to lesser hardware faults or high system load. In all cases, Spokes must detect the problem quickly and route around it. Each repository is replicated on three servers, so there’s almost always an up-to-date, available replica to route to even if one server is offline. Spokes is more than the sum of its individually-failure-prone parts.

Detecting problems quickly is the first step. Spokes uses a combination of heartbeats and real application traffic to determine when a file server is down. Using real application traffic is key for two reasons. First, heartbeats learn and react slowly. Each of our file servers handles a hundred or more incoming requests per second. A heartbeat that happens once per second would learn about a failure only after a hundred requests had already failed. Second, heartbeats test only a subset of the server’s functionality: for example, whether or not the server can accept a TCP connection and respond to a no-op request. But what if the failure mode is more subtle? What if the Git binary is corrupt? What if disk accesses have stalled? What if all authenticated operations are failing? No-ops can often succeed when real traffic will fail.

So Spokes watches for failures during the processing of real application traffic, and it marks a node as offline if too many requests fail. Of course, real requests do fail sometimes. Someone can try to read a branch that has already been deleted, or try to push to a branch they don’t have access to, for example. So Spokes only marks the node offline if three requests fail in a row. That sometimes marks perfectly healthy nodes offline—three requests can fail in a row just by random chance—but not often, and the penalty for it is not large.

Spokes uses heartbeats, too, but not as the primary failure-detection mechanism. Instead, heartbeats have two purposes: polling system load and providing the all-clear signal after a node has been marked as offline. As soon as a heartbeat succeeds, the node is marked as online again. If the heartbeat succeeds despite ongoing server problems (retrieving system load is almost a no-op), the node will get marked offline again after three more failed requests.

So Spokes detects that a node is down within about three failed operations. That’s still three failed operations too many! For clean failures—connections refused or timeouts—all operations know how to try the next host. Remember, Spokes has three or more copies of every repository. A routing query for a repository returns not one server, but a list of three (or so) up-to-date replicas, sorted in preference order. If an operation attempted on the first-choice replica fails, there are usually two other replicas to try.

A graph of operations (here, remote procedure calls, or RPCs) failed over from one server to another clearly shows when a server is offline. In this graph, a single server is unavailable for about 1.5 hours; during this time, many thousands of RPC operations are redirected to other servers. This graph is the single best detector the Spokes team has for discovering misbehaving servers.

Spokes’s node-offline detection is only advisory—i.e., only an optimization. A node that has had three failures in a row just gets moved to the end of the preference order for all read operations, rather than removed from the list of replicas. It’s better for Spokes to try a probably-offline replica last, than to not try it at all.

This failure detector works well for server failures: when a server is overloaded or offline, operations to it will fail. Spokes detects those failures and temporarily stops directing traffic to the failed server until a heartbeat succeeds. However, failures of networks and application (Rails) servers are much messier. A given file server can appear to be offline to just a subset of the application servers, or one bad application server can spuriously determine that every file server is offline. So Spokes’s failure detection is actually MxN: each application server keeps its own list of which file servers are offline, or not. If we see many application servers marking a single file server as offline, then it probably is. If we see a single application server marking many file servers offline, then we’ve learned about a fault on that application server, instead.

The figure below illustrates the MxN nature of failure detection and shows in red which failure detectors are true if a single file server, dfs4, is offline.

In one recent incident, a single front-end application server in a staging environment lost its ability to resolve the DNS names of the file servers. Because it couldn’t reach the file servers to send them RPC operations or heartbeats, it concluded that every file server was offline. But that incorrect determination was limited to that one application server; all other application servers worked normally. So the flaky application server was immediately obvious in the RPC-failover graphs, and no production traffic was affected.

Durability

Sometimes, servers fail. Disks can fail; RAID controllers can fail; even entire servers or entire racks can fail. Spokes provides durability for repository data even in the face of such adversity.

The basic building block of durability, like availability, is replication. Spokes keeps at least three copies of every repository, wiki, and gist, and those copies are in different racks. No updates to a repository—pushes, renames, edits to a wiki, etc.—are accepted unless a strict majority of the replicas can apply the change and get the same result.

Spokes needs just one extra copy to survive a single-node failure. So why a majority? It’s possible, even common, for a repository to get multiple writes at roughly the same time. Those writes might conflict: one user might delete a branch while another user pushes new commits to that same branch, for example. Conflicting writes must be serialized—that is, they have to be applied (or rejected) in the same order on every replica, so every replica gets the same result. The way Spokes serializes writes is by ensuring that every write acquires an exclusive lock on a majority of replicas. It’s impossible for two writes to acquire a majority at the same time, so Spokes eliminates conflicts by eliminating concurrent writes entirely.

If a repository exists on exactly three replicas, then a successful write on two replicas constitutes both a durable set, and a majority. If a repository has four or five replicas, then three are required for a majority.

In contrast, many other replication and consensus protocols have a single primary copy at any moment. The order that writes arrive at the primary copy is the official order, and all other replicas must apply writes in that order. The primary is generally designated manually, or automatically using an election protocol. Spokes simply skips that step and treats every write as an election—selecting a winning order and outcome directly, rather than a winning server that dictates the write order.

Any write in Spokes that can’t be applied identically at a majority of replicas gets reverted from any replica where it was applied. In essence, every write operation goes through a voting protocol, and any replicas on the losing side of the vote are marked as unhealthy—unavailable for reads or writes—until they can be repaired. Repairs are automatic and quick. Because a majority agreed either to accept or to roll back the update, there are still at least two replicas available to continue accepting both reads and writes while the unhealthy replica is repaired.

To be clear, disagreements and repairs are exceptional cases. GitHub accepts many millions of repository writes each day. On a typical day, a few dozen writes will result in non-unanimous votes, generally because one replica was particularly busy, the connection to it timed out, and the other replicas voted to move on without it. The lagging replica almost always recovers within a minute or two, and there is no user-visible impact on the repository’s availability.

Rarer still are whole-disk and whole-server failures, but they do happen. When we have to remove an entire server, there are suddenly hundreds of thousands of repositories with only two copies, instead of three. This, too, is a repairable condition. Spokes checks periodically to see if every repository has the desired number of replicas; if not, more replicas are created. New replicas can be created anywhere, and they can be copied from wherever the surviving two copies of each repository are. Hence, repairs after a server failure are N-to-N. The larger the file server cluster, the faster it can recover from a single-node failure.

Clean shutdowns

As described above, Spokes can deal quickly and transparently with a server going offline or even failing permanently. So, can we use that for planned maintenance, when we need to reboot or retire a server? Yes and no.

Strictly speaking, we can reboot a server with sudo reboot, and we can retire it just by unplugging it. But there are subtle disadvantages to doing so, so we have more careful mechanisms, reusing a lot of the same logic that would respond to a crash or a failure.

Simply rebooting a server does not affect future read and write operations, which will be transparently directed to other replicas. It doesn’t affect in-progress write operations, either, as those are happening on all replicas, and the other two replicas can easily vote to proceed without the server we’re rebooting. But a reboot does break in-progress read operations. Most of those reads—e.g., fetching a README to display on a repository’s home page—are quick and will complete while the server shuts down gracefully. But some reads, particularly clones of large repositories, take minutes or hours to complete, depending on the speed of the end user’s network. Breaking these is, well, rude. They can be restarted on another replica, but all progress up to that point would be lost.

Hence, rebooting a server intentionally in Spokes begins with a quiescing period. While a server is quiescing, it is marked as offline for the purposes of new read operations, but existing read operations, including clones, are allowed to finish. Quiescing can take anywhere from a few seconds to many hours, depending on which read operations are active on the server that is getting rebooted.

Perhaps surprisingly, write operations are sent to servers as usual, even while they quiesce. That’s because write operations run on all replicas, so one replica can drop out at any time without user-visible impact. Also, that replica would fall arbitrarily far behind if it didn’t receive writes while quiescing, creating a lot of catch-up load when it is finally brought fully back online.

We don’t perform “chaos monkey” testing on the Spokes file servers, for the same reasons we prefer to quiesce them before rebooting them: to avoid interrupting long-running reads. That is, we do not reboot them randomly just to confirm that sudden, single-node failures are still (mostly) harmless.

Instead of “chaos monkey” testing, we perform rolling reboots as needed, which accomplish roughly the same testing goals. When we need to make some change that requires a reboot—e.g., changing kernel or filesystem parameters, or changing BIOS settings—we quiesce and reboot each server. Racks serve as availability zones^[1], so we quiesce entire racks at a time. As servers in a given rack finish quiescing—i.e., complete all outstanding read operations—we reboot up to five of them at a time. When a whole rack is finished, we move on to the next rack.

Below is a graph showing RPC operations failed over during a rolling reboot. Each server gets a different color. Values are stacked, so the tallest spike shows a moment where eight servers were rebooting at once. The large block of light red shows where one server did not reboot cleanly and was offline for over two hours.

Retiring a server by simply unplugging it has the same disadvantages as unplanned reboots, and more. In addition to disrupting any in-progress read operations, it creates several hours of additional risk for all the repositories that used to be hosted on the server. When a server disappears suddenly, all of the repositories formerly on it are now down to two copies. Two copies are enough to perform any read or write operation, but two copies aren’t enough to tolerate an additional failure. In other words, removing a server without warning increases the probability of rejecting write operations later that same day. We’re in the business of keeping that probability to a minimum.

So instead, we prepare a server for retirement by removing it from the count of active replicas for any repository. Spokes can still use that server for both read and write operations. But when it asks if all repositories have enough replicas, suddenly some of them—the ones on the retiring server—will say no, and more replicas will be created. These repairs proceed exactly as if the server had just disappeared, except that now the server remains available in case some other server fails.

Conclusions

Availability is important, and durability is more important still. Availability is a measure of what fraction of the time a service responds to requests. Durability is a measure of what fraction of committed data a service can faithfully store.

Spokes keeps at least three replicas of every repository, to provide both availability and durability. Three replicas means that one server can fail with no user-visible effect. If two servers fail, Spokes can provide full access for most repositories and read-only access to repositories that had two of their replicas on the two failing servers.

Spokes does not accept writes to a repository unless a majority of replicas—and always at least two—can commit the write and produce the same resulting repository state. That requirement provides consistency by ensuring the same write ordering on all replicas. It also provides durability in the face of single-server failures by storing every committed write in at least two places.

Spokes has a failure detector, based on monitoring live application traffic, that determines when a server is offline and routes around the problem. Finally, Spokes has automated repairs for recovering quickly when a disk or server fails permanently.

1. Treating racks as availability zones means we place repository replicas so that no repository has two replicas within the same rack. Hence, we can lose an entire rack of servers and not affect the availability or durability of any of the repositories hosted on them. We chose racks as availability zones because several important failure modes, especially related to power and networking, can affect entire racks of servers at a time.

Context aware MySQL pools via HAProxy

2016-08-17T00:00:00+00:00

At GitHub we use MySQL as our main datastore. While repository data lies in git, metadata is stored in MySQL. This includes Issues, Pull Requests, Comments etc. We also auth against MySQL via a custom git proxy (babeld). To be able to serve under the high load GitHub operates at, we use MySQL replication to scale out read load.

We have different clusters to provide with different types of services, but the single-writer-multiple-readers design applies to them all. Depending on growth of traffic, on application demand, on operational tasks or other constraints, we take replicas in or out of our pools. Depending on workloads some replicas may lag more than others.

Displaying up-to-date data is important. We have tooling that helps us ensure we keep replication lag at a minimum, and typically it doesn’t exceed 1 second. However sometimes lags do happen, and when they do, we want to put aside those lagging replicas, let them catch their breath, and avoid sending traffic their way until they are caught up.

We set out to create a self-managing topology that will exclude lagging replicas automatically, handle disasters gracefully, and yet allow for complete human control and visibility.

HAProxy for load balancing replicas

We use HAProxy for various tasks at GitHub. Among others, we use it to load balance our MySQL replicas. Our applications connect to HAProxy servers at :3306 and are routed to replicas that can serve read requests. Exactly what makes a replica able to “serve read requests” is the topic of this post.

MySQL load balancing via HAProxy is commonly used, but we wanted to tackle a few operational and availability concerns:

Can we automate exclusion and inclusion of backend servers based on replication status?
Can we automate exclusion and inclusion of backend servers based on server role?
How can we react to a scenario where too many servers are excluded, and we are only left with one or two “good” replicas?
Can we always serve?
How easy would it be to override pool membership manually?
Will our solution survive a service haproxy reload/restart?

With this criteria in mind, the standard mysql-check commonly used in HAProxy-MySQL load balancing will not suffice.

This simple check merely tests whether a MySQL server is live and doesn’t gain additional insight as for its internal replication state (lag/broken) or for its operational state (maintenance/ETL/backup jobs etc.).

Instead, we make our HAProxy pools context aware. We let the backend MySQL hosts make an informed decision: “should I be included in a pool or should I not?”

Context aware MySQL pools

In its very simplistic form, context awareness begins with asking the MySQL backend replica: “are you lagging?” We will reach far beyond that, but let’s begin by describing this commonly used setup.

In this situation, HAProxy no longer uses a mysql-check but rather an http-check. The MySQL backend server provides an HTTP interface, responding with HTTP 200 or HTTP 503 depending on replication lag. HAProxy will interpret these as “good” (UP) or “bad” (DOWN), respectively. On the HAProxy side, it looks like this:

backend mysql_ro_main
  option httpchk GET /
  balance roundrobin
  retries 1
  timeout connect 1000
  timeout check 300
  timeout server 86400000

  default-server port 9876 fall 2 inter 5000 rise 1 downinter 5000 on-marked-down shutdown-sessions weight 10
  server my-db-0001 my-db-0001.heliumcarbon.com:3306 check
  server my-db-0002 my-db-0002.heliumcarbon.com:3306 check
  server my-db-0003 my-db-0003.heliumcarbon.com:3306 check
  server my-db-0004 my-db-0004.heliumcarbon.com:3306 check
  server my-db-0005 my-db-0005.heliumcarbon.com:3306 check
  server my-db-0006 my-db-0006.heliumcarbon.com:3306 check

The backend servers need to provide an HTTP service on :9876. That service would connect to MySQL, check for replication lag, and return with 200 (say, lag <= 5s) or 503 (lag > 5s or replication is broken).

Some reflections

This commonly used setup automatically excludes or includes backend servers based on replication status. If the server is lagging, the specialized HTTP service will report 503, which HAProxy will interpret as DOWN, and the server will not serve traffic until it recovers.

But, what happens when two, three, or four replicas are lagging? We are left with less and less serving capacity. The remaining replicas are receiving two or three times more traffic than they’re used to receiving. If this happens, the replicas might succumb to the load and lag as well, and the solution above might not be able to handle an entire fleet of lagging replicas.

What’s more, some of our replicas have special roles. Each cluster has a node running continuous logical or physical backups. For example, other nodes might be serving a purely analytical workload or be partially weighted to verify a newer MySQL version.

In the past, we would update the HAProxy config file with the list of servers as they came and went. As we grew in volume and in number of servers this became an operational overhead. We’d rather take a more dynamic approach that provides increased flexibility.

We may operate a MySQL master failover. This may be a planned operation (e.g. upgrading to latest release) or an unplanned one (e.g. automated failover on hardware failure). The new master must be excluded from the read-pool. The old master, if available, may now serve reads. Again, we wish to avoid the need to update HAProxy’s configuration with these changes.

Static HAProxy configuration, dynamic decisions

In our current setup the HAProxy configuration does not regularly change. It may change when we introduce new hardwares now and then, but otherwise it is static, and HAProxy reacts to ongoing instructions by the backend servers telling it:

I’m good to participate in a pool (HTTP 200)
I’m in bad state; don’t send traffic my way (HTTP 503)
I’m in maintenance mode. No error on my side, but don’t send traffic my way (HTTP 404)

The HAProxy config file lists each and every known server. The list includes the backup server. It includes the analytics server. It even includes the master. And the backend servers tell HAProxy whether they wish to participate in taking read traffic or not.

The HAProxy config file lists each and every known server. The list includes the backup server, the analytics server, and even the master. The backend servers themselves tell HAProxy whether they wish to participate in taking read traffic or not.

Before showing you how to implement this, let’s consider availability.

Graceful failover of pools

HAProxy supports multiple backend pools per frontend, and provides with Access Control Lists (ACLs). ACLs often use incoming connection data (headers, cookies etc.) but are also able to observe backend status.

The scheme is to define two (or more) backend pools:

The first (“main”/”normal”) pool consists of replicas with acceptable lag that are able to serve traffic, as above
The second (“backup”) pool consists of valid replicas which are allowed to be lagging

We use an acl that observes the number of available servers in our main backend. We then set a rule to use the backup pool if that acl applies:

frontend mysql_ro
  ...
  acl mysql_not_enough_capacity nbsrv(mysql_ro_main) lt 3
  use_backend mysql_ro_backup if mysql_not_enough_capacity
  default_backend mysql_ro_main

See code sample

In the example above we choose to switch to the mysql_ro_backup pool when left with less than three active hosts in our mysql_ro_main pool. We’d rather serve stale data than stop serving altogether. Of course, by this time our alerting system will have alerted us to the situation and we will already be looking into the source of the problem.

Remember that it’s not HAProxy that makes the decision “who’s in and who’s out” but the backend server itself. To that effect, HAProxy sends a check hint to the server. We choose to send the hint in the form of a URI, as this makes for a readable, clear code:

backend mysql_ro_main
  option httpchk GET /check-lag
  http-check disable-on-404
  balance roundrobin
  retries 1
  timeout connect 1000
  timeout check 300
  timeout server 86400000

  default-server port 9876 fall 2 inter 5000 rise 1 downinter 5000 on-marked-down shutdown-sessions weight 10
  server my-db-0001 my-db-0001.heliumcarbon.com:3306 check
  server my-db-0002 my-db-0002.heliumcarbon.com:3306 check
  server my-db-0003 my-db-0003.heliumcarbon.com:3306 check
  server my-db-0004 my-db-0004.heliumcarbon.com:3306 check
  server my-db-0005 my-db-0005.heliumcarbon.com:3306 check
  server my-db-0006 my-db-0006.heliumcarbon.com:3306 check

backend mysql_ro_backup
  option httpchk GET /ignore-lag
  http-check disable-on-404
  balance roundrobin
  retries 1
  timeout connect 1000
  timeout check 300
  timeout server 86400000

  default-server port 9876 fall 2 inter 10000 rise 1 downinter 10000 on-marked-down shutdown-sessions weight 10
  server my-db-0001 my-db-0001.heliumcarbon.com:3306 check
  server my-db-0002 my-db-0002.heliumcarbon.com:3306 check
  server my-db-0003 my-db-0003.heliumcarbon.com:3306 check
  server my-db-0004 my-db-0004.heliumcarbon.com:3306 check
  server my-db-0005 my-db-0005.heliumcarbon.com:3306 check
  server my-db-0006 my-db-0006.heliumcarbon.com:3306 check

See code sample

Both backend pools list the exact same servers. The major difference between the pools is the check URI:

option httpchk GET /check-lag

vs.

option httpchk GET /ignore-lag

As the URI suggests, the first, main pool is looking for backup servers that do not lag (and we wish to also exclude the master, the backup server, etc.). The backup pool is happy to take servers that actually do lag. But still, it wishes to exclude the master and other special servers.

HAProxy’s behavior is to use the main pool for as long as at least three replicas are happy to serve data. If only two replicas or less are in good shape, HAProxy switches to the backup pool, where we re-introduce the lagging replicas; serving more stale data, but still serving.

Also noteworthy in the above is http-check disable-on-404, which puts a HTTP 404 server in a NOLB state. We will discuss this in more detail soon.

Implementing the HTTP service

Any HTTP service implementation will do. At GitHub, we commonly use shell and Ruby scripts, that integrate well with our ChatOps. We have many reliable shell building blocks, and our current solution is a shell oriented service, in the form of xinetd.

xinetd makes it easy to “speak HTTP” via shell. A simplified setup looks like this:

In the above, we’re in particular interested that xinetd serves on :9876 and calls upon /path/to/scipts/xinetd-mysql to respond to HAPRoxy’s check requests.

Implementing the check script

The xinetd-mysql script routes the request to an appropriate handler. Recall that we asked HAProxy to hint per check. The hint URI, such as /check-lag, is intercepted by xinetd-mysql which further invokes a dedicated handler for this check. Thus, we have different handlers for /check-lag, /ignore-lag, /ignore-lag-and-yes-please-allow-backup-servers-as-well etc.

The real magic happens when running this handler script. This is where the server makes the decision: “Should I be included in the read-pool or not?” The script bases its decision on the following factors:

Did a human suggest that this server be explicitly included/excluded? This is just a matter of touching a file
Is this the master server? A backup server? Something else? The server happens to know its own role via service discovery or even via puppet. We check for a hint file
Is MySQL lagging? Is it alive at all? This (finally) executes a self-check on MySQL. For lag we use a heartbeat mechanism, but your mileage may vary.

This xinetd/shell implementation suggests we do not use persistent MySQL connections; each check generates a new connection on the backend server. While this seems wasteful, the rate of incoming check requests is not high, and negligible in the scale of our busy servers. But, furthermore, this better serves our trust in the system: a hogged server may be able to serve existing connections but refuse new ones; we’re happy to catch this scenario.

404 or error?

Servers that just don’t want to participate send a 404, causing them to go NOLB. Lagging, broken or dead replicas send a 503. This makes it easier on our alerting system and makes it clearer when we have a problem.

One outstanding issue is that HAProxy never transitions from DOWN to NOLB. The automaton requires first going UP. This is not an integrity problem but causes more alerting. We work around this by cross checking servers and refreshing if need be. This is a rare situation for us and thus of no significant concern.

Operations

This small building blocks design permits us to do simple unit testing. Control and visibility are easily gained: disabling and enabling servers is a matter of creating a file. Whether forced to exist by a human or implied by server role.

These scripts integrate well within our chatops. We are able to see the exact response HAProxy sees via simple chatops commands:

shlomi-noach

.mysql xinetd my-db-0004 /check-lag

Hubot

200 ; OK

Or we can interfere and force backends in/out the pools:

shlomi-noach

.mysqlproxy host force-disable my-db-0004

Hubot

Host my-db-0004 disabled by @shlomi-noach at 2016-06-30 15:22:07

shlomi-noach

.mysqlproxy host restore my-db-0004

Hubot

Host my-db-0004 restored to normal state

We have specialized monitoring for these HAProxy boxes, but we don’t wish to be notified if a single replica starts to lag. Rather, we’re interested in the bigger picture: a summary of the total found errors in the pools. This means there’s a difference between a half empty main pool and a completely empty one. In the event of problems, we get a single alert that summarizes the status across the cluster’s pools. As always, we can also check from chatops:

shlomi-noach

.mysqlproxy sup myproxy-0001

Hubot

OK
mysql_ro_main OK
  3/10 servers are nolb in pool
mysql_ro_backup OK
  3/10 servers are nolb in pool

We’ve stripped our script and config files to decouple them from GitHub’s specific setup and flow. We’ve also open sourced them in the hope that you’ll find them useful, and that they’ll help you implement your own solution with context-aware MySQL replica pools.

gh-ost: GitHub’s online schema migration tool for MySQL

2016-08-01T00:00:00+00:00

Today we are announcing the open source release of gh-ost: GitHub’s triggerless online schema migration tool for MySQL.

gh-ost has been developed at GitHub in recent months to answer a problem we faced with ongoing, continuous production changes requiring modifications to MySQL tables. gh-ost changes the existing online table migration paradigm by providing a low impact, controllable, auditable, operations friendly solution.

MySQL table migration is a well known problem, and has been addressed by online schema change tools since 2009. Growing, fast-paced products often require changes to database structure. Adding/changing/removing columns and indexes etc., are blocking operations with the default MySQL behavior. We conduct such schema changes multiple times per day and wish to minimize user facing impact.

Before illustrating gh-ost, let’s address the existing solutions and the reasoning for embarking on a new tool.

Online schema migrations, existing landscape

Today, online schema changes are made possible via these three main options:

Migrate the schema on a replica, clone/apply on other replicas, promote refactored replica as new master
Use MySQL’s Online DDL for InnoDB
Use a schema migration tool. Most common today are pt-online-schema-change and Facebook’s OSC; also found are LHM and the original oak-online-alter-table tool.

Other options include Rolling Schema Upgrade with Galera Cluster, and otherwise non-InnoDB storage engines. At GitHub we use the common master-replicas architecture and utilize the reliable InnoDB engine.

Why have we decided to embark on a new solution rather than use either of the above? The existing solutions are all limited in their own ways, and the below is a very brief and generalized breakdown of some of their shortcomings. We will drill down more in-depth about the shortcomings of the trigger-based online schema change tools.

Replica migration makes for an operational overhead, which requires larger host count, longer delivery times and more complex management. Changes are applied explicitly on specific replicas or on sub-trees of the topology. Such considerations as hosts going down, host restores from an earlier backup, newly provisioned hosts, all require a strict tracking system for per-host changes. A change might require multiple iterations, hence more time. Promoting a replica to master incurs a brief outage. Multiple changes going at once are more difficult to coordinate. We commonly deploy multiple schema changes per day and wish to be free of the management overhead, while we recognize this solution to be in use.
MySQL’s Online DDL for InnoDB is only “online” on the server on which it is invoked. Replication stream serializes the alter which causes replication lag. An attempt to run it individually per-replica results in much of the management overhead mentioned above. The DDL is uninterruptible; killing it halfway results in long rollback or with data dictionary corruption. It does not play “nice”; it cannot throttle or pause on high load. It is a commitment into an operation that may exhaust your resources.
We’ve been using pt-online-schema-change for years. However as we grew in volume and traffic, we hit more and more problems, to the point of considering many migrations as “risky operations”. Some migrations would only be able to run during off-peak hours or through weekends; others would consistently cause MySQL outage. All existing online-schema-change tools utilize MySQL triggers to perform the migration, and therein lies a few problems.

What’s wrong with trigger-based migrations?

All online-schema-change tools operate in similar manner: they create a ghost table, in the likeness of your original table, migrate that table while empty, slowly and incrementally copy data from your original table to the ghost table, meanwhile propagating ongoing changes (any INSERT, DELETE, UPDATE applied to your table) to the ghost table. When the tool is satisfied the tables are in sync, it replaces your original table with the ghost table.

Tools like pt-online-schema-change, LHM and oak-online-alter-table use a synchronous approach, where each change to your table translates immediately, utilizing same transaction space, to a mirrored change on the ghost table. The Facebook tool uses an asynchronous approach of writing changes to a changelog table, then iterating that and applying changes onto the ghost table. All of these tools use triggers to identify those ongoing changes to your table.

Triggers are stored routines which are invoked on a per-row operation upon INSERT, DELETE, UPDATE on a table. A trigger may contain a set of queries, and these queries run in the same transaction space as the query that manipulates the table. This makes for an atomicy of both the original operation on the table and the trigger-invoked operations.

Trigger usage in general, and trigger-based migrations in particular, suffer from the following:

Triggers, being stored routines, are interpreted code. MySQL does not precompile them. Hooking onto your query’s transaction space, they add the overhead of a parser and interpreter to each query acting on your migrated table.
Locks: the triggers share the same transaction space as the original queries, and while those queries compete for locks on the table, the triggers independently compete on locks on another table. This is in particular acute with the synchronous approach. Lock contention is directly related to write concurrency on the master. We have experienced near or complete lock downs in production, to the effect of rendering the table or the entire database inaccessible due to lock contention. Another aspect of trigger locks is the metadata locks they require when created or destroyed. We’ve seen stalls to the extent of many seconds to a minute while attempting to remove triggers from a busy table at the end of a migration operation.
Non pausability: when load on the master turns high, you wish to throttle or suspend your pending migration. However a trigger-based solution cannot truly do so. While it may suspend the row-copy operation, it cannot suspend the triggers. Removal of the triggers results in data loss. Thus, the triggers must keep working throughout the migration. On busy servers, we have seen that even as the online operation throttles, the master is brought down by the load of the triggers.
Concurrent migrations: we or others may be interested in being able to run multiple concurrent migrations (on different tables). Given the above trigger overhead, we are not prepared to run multiple concurrent trigger-based migrations. We are unaware of anyone doing so in practice.
Testing: we might want to experiment with a migration, or evaluate its load. Trigger based migrations can only simulate a migration on replicas via Statement Based Replication, and are far from representing a true master migration given that the workload on a replica is single threaded (that is always the case on a per-table basis, regardless of multi-threaded replication technology in use).

gh-ost

gh-ost stands for GitHub’s Online Schema Transmogrifier/Transfigurator/Transformer/Thingy

gh-ost is:

Triggerless
Lightweight
Pauseable
Dynamically controllable
Auditable
Testable
Trustable

Triggerless

gh-ost does not use triggers. It intercepts changes to table data by tailing the binary logs. It therefore works in an asynchronous approach, applying the changes to the ghost table some time after they’ve been committed.

gh-ost expects binary logs in RBR (Row Based Replication) format; however that does not mean you cannot use it to migrate a master running with SBR (Statement Based Replication). In fact, we do just that. gh-ost is happy to read binary logs from a replica that translates SBR to RBR, and it is happy to reconfigure the replica to do that.

Lightweight

By not using triggers, gh-ost decouples the migration workload from the general master workload. It does not regard the concurrency and contention of queries running on the migrated table. Changes applied by such queries are streamlined and serialized in the binary log, where gh-ost picks them up to apply on the gh-ost table. In fact, gh-ost also serializes the row-copy writes along with the binary log event writes. Thus, the master only observes a single connection that is sequentially writing to the ghost table. This is not very different from ETLs.

Pauseable

Since all writes are controlled by gh-ost, and since reading the binary logs is an asynchronous operation in the first place, gh-ost is able to suspend all writes to the master when throttling. Throttling implies no row-copy on the master and no row updates. gh-ost does create an internal tracking table and keeps writing heartbeat events to that table even when throttled, in negligible volumes.

gh-ost takes throttling one step further and offers multiple controls over throttling:

Load: a familiar feature for users of pt-online-schema-change, one may set thresholds on MySQL metrics, such as Threads_running=30
Replication lag: gh-ost has a built-in heartbeat mechanism which it utilizes to examine replication lag; you may specify control replicas, or gh-ost will implicitly use the replica you hook it to in the first place.
Query: you may present with a query that decides if throttling should kick in. Consider SELECT HOUR(NOW()) BETWEEN 8 and 17.

All the above metrics can be dynamically changed even while the migration is executing.
Flag file: touch a file and gh-ost begins throttling. Remove the file and it resumes work.
User command: dynamically connect to gh-ost (see following) across the network and instruct it to start throttling.

Dynamically controllable

With existing tools, when a migration generates a high load, the DBA would reconfigure, say, a smaller chunk-size, terminate and re-run the migration from start. We find this wasteful.

gh-ost listens to requests via unix socket file and (configurable) via TCP. You may give gh-ost instructions even while migration is running. You may, for example:

echo throttle | socat - /tmp/gh-ost.sock to start throttling. Likewise you may no-throttle
Change execution parameters: chunk-size=1500, max-lag-millis=2000, max-load=Thread_running=30 are examples to instructions gh-ost accepts that change its behavior.

Auditable

Likewise, the same interface can be used to ask gh-ost of the status. gh-ost is happy to report current progress, major configuration params, identity of servers involved and more. As this information is accessible via network, it gives great visibility into the ongoing operation, that you would otherwise find today only by using a shared screen or tailing log files.

Testable

Because the binary log content is decoupled from the master’s workload, applying a migration on a replica is more similar to a true master migration (though still not completely, and more work is on the roadmap).

gh-ost comes with built-in support for testing via --test-on-replica: it allows you to run a migration on a replica, such that at the end of the migration gh-ost would stop the replica, swap tables, reverse the swap, and leave you with both tables in place and in sync, replication stopped. This allows you to examine and compare the two tables at your leisure.

This is how we test gh-ost in production at GitHub: we have multiple designated production replicas; they are not serving traffic but instead running continuous covering migration test on all tables. Each of our production tables, as small as empty and as large as many hundreds of GB, is being migrated via a trivial statement that does not really modify its structure (engine=innodb). Each such migration ends with stopped replication. We take complete checksum of entire table data from both the original table and ghost table and expect them to be identical. We then resume replication and proceed to next table. Every single one of our production tables is known to have passed multiple successful migrations via gh-ost, on replica.

Trustable

All the above, and more, are made to build trust with gh-ost’s operation. After all, it is a new tool in a landscape that has used the same tool for years.

We test gh-ost on replicas; we’ve completed thousands of successful migrations before trying it out on masters for the first time. So can you. Migrate your replicas, verify the data is intact. We want you to do that!
As you execute gh-ost, and as you may suspect load on your master is increasing, go ahead and initiate throttling. Touch a file. echo throttle. See how the load on your master is just back to normal. By just knowing you can do that, you will gain a lot of peace of mind.
A migration begins and the ETA says it’s going to end at 2:00am? Are you concerned with the final cut-over, where the tables are swapped, and you want to stick around? You can instruct gh-ost to postpone the cut-over using a flag file. gh-ost will complete the row-copy but will not flip the tables. Instead, it will keep applying ongoing changes, keeping the ghost table in sync. As you come to the office the next day, remove the flag file or echo unpostpone into gh-ost, and the cut-over will be made. We don’t like our software to bind us into observing its behavior. It should instead liberate us to do things humans do.
Speaking of ETA, --exact-rowcount will keep you smiling. Pay the initial price of a lengthy SELECT COUNT(*) on your table. gh-ost will get an accurate estimate of the amount of work it needs to do. It will heuristically update that estimation as migration proceeds. While ETA timing is always subject to change, progress percentage turns accurate. If, like us, you’ve been bitten by migrations stating 99% then stalling for an hour keeping you biting your fingernails, you’ll appreciate the change.

gh-ost operation modes

gh-ost operates by connecting to potentially multiple servers, as well as connecting itself as a replica in order to stream binary log events directly from one of those servers. There are various operation modes, which depend on your setup, configuration, and where you want to run the migration.

a. Connect to replica, migrate on master

This is the mode gh-ost expects by default. gh-ost will investigate the replica, crawl up to find the topology’s master, and connect to it as well. Migration will:

Read and write row-data on master
Read binary logs events on the replica, apply the changes onto the master
Investigate table format, columns & keys, count rows on the replica
Read internal changelog events (such as heartbeat) from the replica
Cut-over (switch tables) on the master

If your master works with SBR, this is the mode to work with. The replica must be configured with binary logs enabled (log_bin, log_slave_updates) and should have binlog_format=ROW (gh-ost can apply the latter for you).

However even with RBR we suggest this is the least master-intrusive operation mode.

b. Connect to master

If you don’t have replicas, or do not wish to use them, you are still able to operate directly on the master. gh-ost will do all operations directly on the master. You may still ask it to be considerate of replication lag.

Your master must produce binary logs in RBR format.
You must approve this mode via --allow-on-master.

c. Migrate/test on replica

This will perform a migration on the replica. gh-ost will briefly connect to the master but will thereafter perform all operations on the replica without modifying anything on the master. Throughout the operation, gh-ost will throttle such that the replica is up to date.

--migrate-on-replica indicates to gh-ost that it must migrate the table directly on the replica. It will perform the cut-over phase even while replication is running.
--test-on-replica indicates the migration is for purpose of testing only. Before cut-over takes place, replication is stopped. Tables are swapped and then swapped back: your original table returns to its original place. Both tables are left with replication stopped. You may examine the two and compare data.

gh-ost at GitHub

gh-ost is now powering all of our production migrations. We’re running it daily, as engineering requests come, sometimes multiple times a day. With its auditing and control capabilities, we will be integrating it into our chatops. Our engineers will have clear insight into migration progress and will be able to control its behavior. Metrics and events are being collected and will provide with clear visibility into migration operations in production.

Open source

gh-ost is released with to the open source community under the MIT license.

While we find it to be stable, we have improvements we want to make. We release it at this time as we wish to welcome community participation and contributions. From time to time we may publish suggestions for community contributions.

gh-ost is actively maintained. We encourage you to try it out, test it; we’ve made great efforts to make it trustworthy.

Acknowledgements

gh-ost is designed, developed, reviewed and tested by the database infrastructure engineering team at GitHub:

@jonahberquist, @ggunson, @tomkrouper, @shlomi-noach

We would like to acknowledge the engineers at GitHub who have provided valuable information and advice. Thank you to our friends from the MySQL community who have reviewed and commented on this project during its pre-production stages.

SYN Flood Mitigation with synsanity

2016-07-12T00:00:00+00:00

GitHub hosts a wide range of user content, and like all large websites this often causes us to become a target of denial of service attacks. Around a year ago, GitHub was on the receiving end of a large, unusual and very well publicised attack involving both application level and volumetric attacks against our infrastructure.

Our users rely on us to be highly available and we take this seriously. Although the attackers are doing the wrong thing, there’s no use blaming the attacker for their attacks being successful. Our commitment is to own our own availability, and that we have a responsibility to mitigate these sorts of attacks to the maximum extent technically possible.

In an effort to reduce the impact of these attacks, we began work on a series of additional mitigation strategies and systems to better prepare us for a future attack of a similar nature. Today we’re sharing our mitigation for one of the attacks we received: synsanity, a SYN flood DDoS mitigation module for Linux 3.x.

What is a SYN flood anyway?

SYN floods are one of the oldest and most common attacks, so common that the Linux kernel includes some built in support for mitigating them. When a client connects to a server using TCP, it uses the three-way handshake to synchronise:

A SYN packet is essentially the client telling the server “I’d like to connect”. During this handshake, both client and server generate random Initial Sequence Numbers (ISNs), which are used to synchronise the TCP connection between the two parties. These sequence numbers let TCP keep track of which messages have been sent and acknowledged by the other party.

A SYN flood abuses this handshake by only going part way through the handshake. Rather than progressing through the normal sequence, an attacker floods the target server with as many SYN packets as they can muster, from as many different hosts as they can, and spoofing the origin IP as much as they can.

The host receiving the SYN flood must respond to each and every packet with a SYN-ACK, but unfortunately the source IP was likely spoofed, so they go nowhere (or worse, come back as rejected). These packets are almost indistinguishable from real SYN packets from real clients, which means it’s hard or impossible to filter out the bad ones on the server. Even external DDoS scrubbing services can only guess whether a packet is legitimate or part of a flood, making it difficult to mitigate an attack without impacting legitimate traffic.

To make matters worse, when the server is handling normal connections and receives the ACK from a real client, it still needs to know that it came from a SYN packet it sent, so it must also keep a list of connections (in state SYN_RECV) for which a SYN has been received and an ACK has not yet been received.

During a SYN flood, this behaviour is undesirable. If the queue of connections in SYN_RECV has no size limit, memory will get exhausted pretty quickly. If it does have a size limit, as is the case in Linux, then there’s no more space to store state and the connections will simply fail as the packets are dropped.

SYN cookies

SYN cookies are a clever way of avoiding the storage of TCP connection state during the initial handshake, deferring that storage until a valid ACK has been received. It works by crafting the Initial Sequence Number (ISN) in the SYN-ACK packet sent by the server in such a way that it cryptographically hashes details about the initial SYN packet and its TCP options, so that when the ACK is received (with a sequence number 1 larger than the ISN), the server can validate that it generated the SYN-ACK packet for which an ACK is now being received. The server stores no state for the connection until the ACK (containing the validated SYN cookie) is received, and only at that point is state regenerated and stored.

Since this hash is calculated with a secret that only the server knows, it doesn’t significantly weaken the sequence number selection and it’s still difficult for someone to forge an ACK (or other packet) for a different connection without having seen the SYN-ACK from the real server.

SYN cookies have been around for a while, and they have fairly minimal impact on the reliability and spoof-protection of TCP. Rather than enabling them constantly, the Linux kernel by default automatically enables SYN cookies only when the SYN receive queue is full. This means that under normal circumstances when no SYN flood is occurring, you get no impact at all, but during a SYN flood, you accept the minimal impact of SYN cookies (in return for not dropping connections). The extra CPU cost of creating SYN cookies is offset by the fact that you no longer have a limited resource, and in practise this is an excellent trade-off.

In Linux 3.x, SYN cookies are generated inside a machine-wide lock on the LISTEN socket that the packet was destined for. This implementation causes all SYN cookies to be generated serially across all cores, defeating the benefits of a multi-processor system. To make matters worse, all cores spin waiting for the lock to become available. This was fine back in the days when an average attacker could only send a few MBits of SYN packets your way, mostly thanks to the networks being much slower. These days however, with servers attached to transit providers with multiple 10GB+ links the whole way down the line, it’s now possible to completely saturate CPU resources.

While Linux 4.x has a patch to send SYN cookies under a per-CPU-core socket lock, which does fix the problem, we wanted a solution that allowed us to use an existing, maintained kernel with upstream security patches. We didn’t want to roll and maintain an entire custom kernel and all related future security patches just to mitigate this form of attack. Patching Linux 3.x to backport the socket lock change was also a similar maintenance burden we wanted to avoid.

SYNPROXY

One solution to get the best of both worlds was the SYNPROXY iptables module. It sits in netfilter in the kernel, before the Linux TCP stack, and as the name suggests, proxies all connections while generating SYN cookies. When a SYN packet comes in, it responds with a SYN-ACK and throws away all state. On receipt of a valid ACK packet matching the SYN cookie, it then sends a SYN downstream and completes the usual TCP handshake. For every subsequent packet in each direction, it modifies the sequence numbers so that it is transparent to both sides.

This is quite an intrusive way of solving the problem since it touches every packet during the entire connection, but it does successfully mitigate SYN floods. Unfortunately we found that in practise under our load and with the amount of malformed packets we receive, it quickly broke down and caused a kernel panic. Additionally, it had to be enabled all the time, since there was no simple way to activate it only when under attack. This meant that we would have to accept the minimal impact of SYN cookies constantly, and at our scale this still would likely cause issues for some of our users.

We decided that it was more complicated than it needed to be for our use case, and we wanted a simpler solution that would only touch the packets that needed to be touched to mitigate a SYN flood. We also decided that a mitigation should only cause potential (even if minimal) impact during mitigation, and not under normal operation.

synsanity

Enter synsanity, our solution to mitigate SYN floods on Linux 3.x. synsanity is inspired by SYNPROXY, in that it is an iptables module that sits inside iptables between the Linux TCP stack and the network card. The major difference is that rather than touch all packets, synsanity simply generates a SYN cookie identically to the way the Linux kernel would generate one if the SYN queue was full, and once it validates the ACK packet, it allows it through to the standard Linux SYN cookie code, which creates and completes the connection. After this point, synsanity doesn’t touch any further packet in the TCP connection.

Similar to the way that Linux only enables SYN cookies when the SYN queue overflows, we only enable synsanity when the SYN queue overflows as well. We match the core Linux code exactly, except that we do it in an iptables module, outside the LISTEN lock. Since an iptables module can be compiled and maintained outside the Linux kernel source tree itself, we don’t need to use a custom Linux kernel, and can instead just maintain and deploy a single module to our servers.

synsanity has allowed us to mitigate multiple attacks that would have previously caused a partial or complete service outage, both long running attacks and large volume attacks.

synsanity sending SYN cookies during a 300kpps SYN flood

Open Source

We believe that if you need to hide your mitigation to keep it secure, it’s not designed well enough. The best and most secure tools are shared, open and subject to community scrutiny, so today we’re open sourcing synsanity so that everyone can benefit from this work.

GitHub Engineering

Orchestrator at GitHub

Automated failovers

orchestrator

Failure detection

Failover

Anti-flapping and acknowledgements

Topology management

Chatops integration

orchestrator @ GitHub

Outbrain

Further acknowledgements

Related projects

How we made diff pages three times faster

Historical approach and problems

Our Goals

A new approach

Diff “table of contents” with git-diff-tree

Fetching diff text with git-diff-pairs

Change statistics with git-diff-tree --numstat --shortstat

Patches in batches: a whole new diff

Looking to the future

GLB part 2: HAProxy zero-downtime, zero-delay reloads with multibinder

HAProxy almost-safe reloads

Supporting zero-downtime, zero-delay reloads

Example & multiple instances

octocatalog-diff: GitHub’s Puppet development and testing tool

Existing landscape of Puppet testing

octocatalog-diff

octocatalog-diff key features

octocatalog-diff at GitHub

Open source

Acknowledgements

Introducing the GitHub Load Balancer

Out with the old, in with the new

Design

Stretching an IP

L4/L7 split design

Designing a better director

Stay tuned

The GitHub GraphQL API

Why?

The switch

Defining the schema

Open source

The future

Building resilience in Spokes

What is Spokes?

Defining resilience

Availability

Durability

Clean shutdowns

Conclusions

Context aware MySQL pools via HAProxy

HAProxy for load balancing replicas

Context aware MySQL pools

Some reflections

Static HAProxy configuration, dynamic decisions

Graceful failover of pools

Implementing the HTTP service

Implementing the check script

404 or error?

Operations

gh-ost: GitHub’s online schema migration tool for MySQL

Online schema migrations, existing landscape

What’s wrong with trigger-based migrations?

gh-ost

Triggerless

Lightweight

Pauseable

Dynamically controllable

Auditable

Testable

Trustable

gh-ost operation modes

a. Connect to replica, migrate on master

b. Connect to master

c. Migrate/test on replica

gh-ost at GitHub

Open source

Diff “table of contents” with `git-diff-tree`

Fetching diff text with `git-diff-pairs`

Change statistics with `git-diff-tree --numstat --shortstat`

`octocatalog-diff`

`octocatalog-diff` key features

`octocatalog-diff` at GitHub