We combine them with logic in our API that utilizes each data set under specific conditions.
Because batch computations are repeatable and more fault-tolerant than speed, our API’s always favor batch produced data. So, for example, if our API receives a request for data for a thirty-day, time-series DAU graph, it will first request the full range from the batch-serving Cassandra cluster. If this is a historic query, all of the data will be satisfied there. However, in the more likely case that the query includes the current day, the query will be satisfied mostly by batch produced data, and just the most recent day or two will be satisfied by speed data.
Handling of failure scenarios
Let’s go over a few different failure scenarios and see how this architecture allows us to gracefully degrade instead of go down or lose data when faced with them.
We already discussed the on-device retry-after-back-off strategy. The retry ensures that data eventually gets to our servers in the presence of client-side network unavailability, or brief server outages on the back-end. A randomized back-off ensures that devices don’t overwhelm (DDos) our servers after a brief network outage in a single region or a brief period of unavailability of our back-end servers.
What happens if our speed (real-time) processing layer goes down? Our on-call engineers will get paged and address the problem. Since the input to the speed processing layer is a durable Kafka cluster, no data will have been lost and once the speed layer is back and functioning, it will catch up on the data that it should have processed during its downtime. Since the speed layer is completely decoupled from the batch layer, batch layer processing will go-on unimpacted. Thus the only impact is delay in real-time updates to data points for the duration of the speed layer outage.
What happens if there are issues or severe delays in the batch layer? Our APIs will seamlessly query for more data from the speed layer. A time-series query that may have previously received one day of data from the speed layer will now query it for two or three days of data. Since the speed layer is completely decoupled from the batch layer, speed layer processing will go-on unimpacted. At the same time, our on-call engineers will get paged and address the batch layer issues. Once the batch layer is back up, it will catch up on delayed data processing, and our APIs will once again seamlessly utilize the batch produced data that is now available.
Our back-end architecture consists of four major components: event reception, event archival, speed computation, and batch computation. Durable queues between each of these components ensure that an outage in one of the components does not spill over to others and that we can later recover from the outage. Query logic in our APIs allows us to seamlessly gracefully degrade and then recover when one of the computations layers is delayed or down and then comes back up.
Our goal for Answers is to create a dashboard that makes understanding your user base dead simple so you can spend your time building amazing experiences, not digging through data. Learn more about Answers here and get started today.
Big thanks to the Answers team for all their efforts in making this architecture a reality. Also to Nathan Marz for his Big Data book.
Contributors
Andrew Jorgensen, Brian Swift, Brian Hatfield, Michael Furtak, Mark Pirri, Cory Dolphin, Jamie Rothfeder, Jeff Seibert, Justin Starry, Kevin Robinson, Kristen Johnson, Marc Richards, Patrick McGee, Rich Paret, Wayne Chang.