The outage of the Amazon S3 Service in Northern Virginia on the 28th of February 2016 was a warning shot to all of us involved in the delivery of software services of all descriptions. It made front-page news and highlights the scrutiny that public cloud providers are under and how much they are relied upon.
Amazon’s summary of the event can be read here and I would encourage you to give it a scan, it should only take a couple of minutes.
Here’s a brief summary of the key points in case you want to refer to them whilst reading the rest of this:
- The root cause was incorrect command input from one of their engineers.
- They were trying to bring some production server instances involved in the billing process offline and ended up removing a larger number of server instances than intended.
- Two other Amazon Web Services (AWS) subsystems rely on these server instances providing a certain level of capacity. The input error caused the capacity to drop below this level.
- These two subsystems needed a restart. It took a couple of hours for them to fully recover from this.
My intention with this blog post is to relate some of the issues that Amazon encountered with general approaches to mitigate the risk around that class of issue. I’m not trying to give the impression that I’m in any position to be giving Amazon advice. They have great people who know what they’re doing. Topical issues, especially those that have an impact in mainstream news, are a great opportunity for us to discuss problems at a high level with the real-life issue providing some context. So that’s what we’re going for here.
I’d like to complement Amazon on their transparency around this situation. They haven’t looked to cover up what happened or what the root cause was and they should be commended for that. There’s a reasonable amount of technical detail here that they’re not obliged to provide, but the fact they have means that we, the software community at large, have a chance to learn from their hard lesson as well.
The dangers of manually interfering with production systems
There is always an element of risk when a human manually interferes with a production system, even though it’s easier to say with hindsight. With every single manual undocumented tweak made by a sysadmin, you lose more collective knowledge about exactly what state the system is in. You also lose replay-ability. How many of us managing production environments would know the precise steps to take to restore our service if a bomb went off in our datacenter? The answer is usually: nobody.
A solution to reducing human error in managing systems is the use of configuration management systems like Chef, Puppet, Ansible and PowerShell DSC. These systems promote the idea of keeping your system configuration as code that can be reviewed and checked into source control as a source of truth, the sort of due-diligence that a manual shell command won’t go through. If you’re not familiar with these systems, the sort of code you write to enforce desired-state on a system is not the sort of imperative code that a developer might write to implement a new service feature, it is declarative and the run-time logic handled by an interpreter in response to state changes. It’s simpler and designed to be accessible to anybody with a scripting background, so is a great skill for sysadmins to pick up to future-proof their careers.
Now at this point many of you will be thinking: “That’s all well and good Kirk, but it’s not relevant. The issue here wasn’t that the server instances were misconfigured, it’s that there wasn’t enough of them.” – and you would be right. Bear with me.
You can take the same idea of having a system enforce a particular instance’s state and have a system enforce state on a cluster of instances and orchestrate them all towards desired-state. An example of this is what a Kubernetes controller is to a cluster of container instances. If your desired-state is to have a minimal number of server instances then as instances are being taken out of production, the controller can automatically spin up new instances to replace them with no direct human involvement. It’s higher-order but it’s the same idea.
From their summary, Amazon’s solution to this was to change their system to enforce a minimal number of these server instances and to scale them down more gradually when they are to be taken offline. This was the right thing for them to do and indicates that they have the systems in place to deliver automated desired state configuration, even if the particular constraint that they accidentally broke wasn’t enforced by this system originally. Better yet, if they can make the system self-healing and have it create and configure new instances to replace those being taken out of production for debugging purposes.
Design and test for failure
The blame for any commercial damage caused by the outage cannot be placed entirely at Amazon’s door as the issue only occurred in one datacenter, Northern Virginia (or US-EAST-1). There are numerous strategies and tools that provide redundancy to protect against a large cloud outage, including cross-region replication of data to another AWS datacenter. So if customers were affected by the outage then it is because their disaster recovery plans were inadequate also.
Moving workloads to the public cloud do not mean that high availability and disaster recovery is outsourced alongside the infrastructure. Software should be designed so that there are no single points of failure, likewise, robustness to outages of various severity should be pro-actively tested.
One service famously hosted in the Northern Virginia data center is Netflix, who suffered no noticeable disruption during the outage whatsoever. This is because Netflix are world leaders in designing for failure and they actively test how their systems respond to outages with automation tools. Better yet, they’ve even open-sourced them, you can read about Netflix’s Simian Army here. Their flagship tool is called Chaos Monkey and its job is to randomly kill their production virtual machine and container instances to prove that their systems can recover from outages without customers noticing. It’s the software version of letting a raging ape run riot in your datacenter! The most appropriate tool in the Simian Army in the case of the S3 outage is Chaos Gorilla, which simulates an outage of an entire Amazon data center to test the sort of automated failover that protected Netflix when it happened for real.
Similarly, if Amazon had tested the failures of their index and placement subsystems proactively they might have been able to take some preemptive action and refactor to improve its recovery time. They may do this testing already, but in their summary, they say that these subsystems had not been restarted for several years and the work required to restart them safely surprised them. So it’s probably safe to assume that they don’t do this sort of testing for these particular subsystems.
Embrace the hard lessons
Amazon has gotten a hard time in some sections of the media for the outage, but no service in the world has a 100% service level agreement and it’s not like their competitors never have outages of their own, I can assure you they do! Bugs happen, people make mistakes and not everything that could have been done to mitigate the risks makes it to the top of the backlog.
Amazon will have performed a post-mortem internally and identified deliverables that will ultimately improve the service AWS customers receive in the long run and we’ve got to view that as a good thing. That’s how I think the software community need to view situations like this when we find ourselves in them. We should expect failure, design under the assumption that it will happen and when it does try to extract the maximum value from the situation by identifying weaknesses and vulnerabilities honestly, openly and without blame.