This is the second article in a short series that explores what went wrong in the recent AWS outage. Read the first article.
For more than 5 hours on Feb. 28, 2017, the US-EAST-1 Region of the Amazon Web Services (AWS) ecosystem experienced serious availability issues. That’s the day that Amazon’s Simple Storage Service (S3) suffered a catastrophic technical malfunction. The internet ground into slow motion, or worse, became completely unresponsive for some customers of the cloud giant.
Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 management console, new instance launches of Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda, were also impacted while the S3 APIs were unavailable. Amazon later admitted the outage was down to human error, or at least that’s my interpretation of their note.
Effectively, Amazon’s conclusion is that the S3 outage occurred when an employee who was making a controlled and authorized change to delete servers, deleted too many — and possibly the wrong ones. This included servers that were driving some other critical but unlinked tasks in their index system.
It’s possible that Amazon’s system index had grown to such a colossal size that nobody completely understood all the integrated functions relying on it, and consequently, nobody in the organization could predict the seismic effect of restarting it. So even Amazon, a leader in this space, can experience issues with control and information at times. This is not a risk unique to them. Despite the best of procedures and controls, as in any business, human intervention can still produce unexpected and catastrophic results.
For all the robotic cleverness that is touted by automation for managing modern cloud environments — public or private — a lack of visibility and functional clarity, when combined with manual tasks, means they are exposed to potential human error which can take even the most robust of cloud services down. There are, arguably, few processes that can be completely automated. Automation works on the input given to it, input that often comes from engineers or programmers. So long as this is the case, the risk of things going wrong will always be there. We should accept that this is our reality and with awareness of it, plan to limit impacts and recover when it does go wrong.
The hit was far-reaching and widespread, from global household brands like Apple — whose iCloud service was affected — to developer sites like Docker to email providers such as Yahoo, and on down the line to individual netizens searching for their favorite photos of the kids. Some of the worst affected were companies in the IoT space whose smart home customers were left with homes that weren’t so smart, and possibly dark and cold too, as smart thermostats and light bulbs failed to respond to input. Reminiscent of the tale of the unsinkable Titanic, the unbreakable cloud was, to everyone’s shock and amazement, broken.
What other ticking time bombs exist in the cloud, undetected, waiting to wreak havoc on unsuspecting consumers going about their business? The fact is, even the most advanced environments go wrong, and some environments can rely on technology to a point where staff lose real control and understanding. This is as much of a risk to businesses as the actual outage itself. This, however, is where the unstoppable march of technology and cloud adoption is leading us. Consequently, it might be possible that some companies look to the cloud as infallible, immune to failure, a sort of religious faith that nothing can and will go wrong. Faith has replaced tried-and-trusted confidence, which is not a comforting attribute for corporate IT managers undertaking their risk assessments.
As economics drives clients to seek savings and bring agility to corporate IT, it potentially dictates a one-way path to the cloud. What then can we advise our customers to do to reduce risk and bring back confidence? Things will break; even the most reliable services will experience an event of some kind. The key is in preparation and planning, to ensure the next cloud outage that hits is only a hiccup and not an incapacitating heart attack. The steps to a plan are simple:
- Understand your (clients’) environment end-to-end.
- Know the impacts of a loss of a cloud service, application or data on BAU.
- Be familiar with the disaster recovery design options available.
- Don’t put all your eggs in the same basket.
- Operating in the cloud is a living, evolving activity, not a singular event built around the initial move.
- Have a plan? Does it work? How do you know? Only way to be sure is to test and practice it.
In the last part of this series, you can learn more on the options for reducing risk from cloud outages.
Lee is the Chief Cloud Architect for Leidos UK/Europe.More Content by Lee Benning