The Netflix Tech Blog: root-cause analysis

Showing posts with label root-cause analysis. Show all posts

Monday, October 29, 2012

Post-mortem of October 22,2012 AWS degradation

By Jeremy Edberg & Ariel Tseitlin

On Monday, October 22nd, Amazon experienced a service degradation. It was highly publicized due to the fact that it took down many popular websites for multiple hours. Netflix however, while not completely unscathed, handled the outage with very little customer impact. We did some things well and could have done some things better, and we'd like to share with you the timeline of the outage from our perspective and some of the best practices we used to minimize customer impact.

Event Timeline

On Monday, just after 8:30am, we noticed that a couple of large websites that are hosted on Amazon were having problems and displaying errors. We took a survey of our own monitoring system and found no impact to our systems. At 10:40am, Amazon updated their status board showing degradation in the EBS service. Since Netflix focuses on making sure services can handle individual instance failure and since we avoid using EBS for data persistence, we still did not see any impact to our service.

At around 11am, some Netflix customers started to have intermittent problems. Since much of our client software is designed for resilience against intermittent server problems, most customers did not notice. At 11:15am, the problem became significant enough that we opened an internal alert and began investigating the cause of the problem. At the time, the issue was exhibiting itself as a network issue, not an EBS issue, which caused some initial confusion. We should have opened an alert earlier, which would have helped us narrow down the issue faster and let us remediate sooner.

When we were able to narrow down the network issue to a single zone, Amazon was also able to confirm that the degradation was limited to a single Availability Zone. Once we learned the impact was isolated to one AZ, we began evacuating the affected zone.

Due to previous single zone outages, one of the drills we run is a zone evacuation drill. Between the zone evacuation drill and our learnings from previous outages, the decision to evacuate the troubled zone was an easy one -- we expected it to be as quick and painless as it was during past drills. So that is what we did.

In the past we identified zone evacuations as a good way of solving problems isolated to a single zone and as such have made it easy in Asgard to do this with a few clicks per application. That preparation came in handy on Monday when we were able to evacuate the troubled zone in just 20 minutes and completely restore service to all customers.

Building for High Availability

We’ve developed a few patterns for improving the availability of our service.
Past outages and a mindset for designing in resiliency at the start have taught us a few best-practices about building high availability systems.

Redundancy

One of the most important things that we do is we build all of our software to operate in three Availability Zones. Right along with that is making each app resilient to a single instance failing. These two things together are what made zone evacuation easier for us. We stopped sending traffic to the affected zone and everything kept running. In some cases we needed to actually remove the instances from the zone, but this too was done with a just a few clicks to reconfigure the auto scaling group.

We apply the same three Availability Zone redundancy model to our Cassandra clusters. We configure all our clusters to use a replication factor of three, with each replica located in a different Availability Zone. This allowed Cassandra to handle the outage remarkably well. When a single zone became unavailable, we didn't need to do anything. Cassandra routed requests around the unavailable zone and when it recovered, the ring was repaired.

Simian Army

Everyone has the best intentions when building software. Good developers and architects think about error handling, corner cases, and building resilient systems. However, thinking about them isn’t enough. To ensure resiliency on an ongoing basis, you need to alway test your system’s capabilities and its ability to handle rare events. That’s why we built the Simian Army: Chaos Monkey to test resilience to instance failure, Latency Monkey to test resilience to network and service degradation, and Chaos Gorilla to test resilience to zone outage. A future improvement we want to make is expanding the Chaos Gorilla to make zone evacuation a one-click operation, making the decision even easier. Once we build up our muscles further, we want to introduce Chaos Kong to test resilience to a complete regional outage.

Easy tooling & Automation

The last thing that made zone evacuation an easy decision is our cloud management tool, known as Asgard . With just a couple of clicks, service owners are able to stop the traffic to the instances or delete the instances as necessary.

Embracing Mistakes

Every time we have an outage, we make sure that we have an incident review. The purpose of these reviews is not to place blame, but to learn what we can do better. After each incident we put together a detailed timeline and then ask ourselves, “What could we have done better? How could we lessen the impact next time? How could we have detected the problem sooner?” We then take those answers and try to solve classes of problems instead of just the most recent problem. This is how we develop our best practices.

Conclusion

We weathered this last AWS outage quite well and learned a few more lessons to improve on. With each outage, we look for opportunities to improve both the way our system is built and the way we detect and react to failure. While we feel we’ve built a highly available and reliable service, there’s always room to grow and improve.

If you like thinking about high availability and how to build more resilient systems, we have many openings throughout the company, including a few Site Reliability Engineering positions.

Friday, July 6, 2012

Lessons Netflix Learned From The AWS Storm

by Greg Orzell & Ariel Tseitlin

Overview

On Friday, June 29th, we experienced one of the most significant outages in over a year. It started at about 8 PM Pacific Time and lasted for about three hours, affecting Netflix members in the Americas. We’ve written frequently about our resiliency efforts and our experience with the Amazon cloud. In the past, we’ve been able to withstand Amazon Web Services (AWS) availability zone outages with minimal impact. We wanted to take this opportunity to share our findings about why this particular zone outage had such an impact.

For background, you can read about Amazon’s root-cause analysis of their outage here: http://aws.amazon.com/message/67457/. The short version is that one of Amazon’s Availability Zones (AZs) failed on Friday evening due to a power outage that was caused by a severe storm. Power was restored 20 minutes later. However, the Elastic Load Balancing (ELB) service suffered from capacity problems and an API backlog, which slowed recovery.

Our own root-cause analysis uncovered some interesting findings, including an edge-case in our internal mid-tier load-balancing service. This caused unhealthy instances to fail to deregister from the load-balancer which black-holed a large amount of traffic into the unavailable zone. In addition, the network calls to the instances in the unavailable zone were hanging, rather than returning no route to host.

As part of this outage we have identified a number of things that both we and Amazon can do better, and we are working with them on improvements.

Middle-tier Load Balancing

In our middle tier load-balancing, we had a cascading failure that was caused by a feature we had implemented to account for other types of failures. The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously. This was done to deal with network partition events and was intended to be a short-term freeze until someone could investigate the large-scale issue. Unfortunately, getting out of this state proved both cumbersome and time consuming, causing services to continue to try and use servers that were no longer alive due to the power outage

Gridlock

Clients trying to connect to servers that were no longer available led to a second-order issue. All of the client threads were taken up by attempted connections and there were very few threads that could process requests. This essentially caused gridlock inside most of our services as they tried to traverse our middle-tier. We are working to make our systems resilient to these kinds of edge cases. We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly.

Summary

Netflix made the decision to move from the data center to the cloud several years ago [1]. While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved. When we dig into the root-causes of our biggest outages, we find that we can typically put in resiliency patterns to mitigate service disruption.

There were aspects of our resiliency architecture that worked well:

Regional isolation contained the problem to users being served out of the US-EAST region. Our European members were unaffected.
Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.
Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.

The state of the cloud will continue to mature and improve over time. We’re working closely with Amazon on ways that they can improve their systems, focusing our efforts on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.

We take our availability very seriously and strive to provide an uninterrupted service to all our members. We’re still bullish on the cloud and continue to work hard to insulate our members from service disruptions in our infrastructure.

We’re continuing to build up our Cloud Operations and Reliability Engineering team, which works on exactly the types of problems identified above, as well as each service team to deal with resiliency. Take a look at jobs.netflix.com for more details and apply directly or contact @atseitlin if you’re interested.

[1] http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html