AWS Blames Unplanned System Starts for S3 Outage

The company claimed that the Amazon Web Services (AWS), which was down for hours this week, was caused by an inadvertent restart of key Amazon S3 subsystems.
AWS released its postmortem report Thursday explaining the chain of events that caused the outage. It also included many AWS applications, which remained down for nearly four hours on Tuesday.
The Amazon S3 storage service in AWS Northern Virginia, which houses a large portion of the vendor’s cloud infrastructure, was affected by the outage. It occurred between 9:30 AM PST and just before 2 PM PST.
AWS claims that its technicians were investigating a slowness in the S3 billing system on the morning of the outage when one of them entered an incorrect command. This caused AWS to initiate an unplanned restart.
The Amazon Simple Storage Service team (S3) was working to fix an issue that caused the S3 billing system’s progress to be slower than expected. A S3 team member authorized to execute the command executed at 9:37 AM PST. The command was intended to remove a few servers from one of the S3 subsystems used in the S3 billing process. The command was not executed correctly and the command resulted in a larger number of servers being removed than was intended. Two other S3 subsystems were supported by the servers that were accidentally removed. The index subsystem, which manages metadata and location information for all S3 objects in the area, is one of these subsystems. This subsystem is required to service all GET, LIST and PUT requests. The placement subsystem manages the allocation of new storage. This subsystem requires that the index subsystem be properly functioning in order to function. The placement subsystem is used to allocate storage for new objects during PUT requests. Each of these systems had to be restarted after a significant amount of their capacity was removed. These subsystems were being restarted but S3 was unable service requests.
AWS stated that Amazon S3 subsystems are designed for “the removal or failure to significant capacity”, but the subsystems in larger regions of Amazon S3 have not been restarted in many years.
Northern Virginia is AWS’ most dense region in terms of how much of its cloud infrastructure is there. AWS doesn’t give official numbers, but Accenture Technology Labs conducted a 2012 study and found that Northern Virginia was responsible for 70% of AWS’ total server racks. It is also AWS’ oldest region. It was established in 2006.
The company stated that S3 has seen a tremendous growth over the past several years. It took longer than expected to restart these services and run the safety checks necessary to validate the integrity and validity of the metadata.
AWS explained the steps it is taking to prevent similar outages in future — such as changing its processes to ensure that technicians don’t remove too many server capacities too quickly and cause a restart.
It is making it a priority to refactor the S3 service into smaller, more manageable “cells”. AWS explained that dividing services into cells allows technicians to quickly test for potential problems and minimize downtime. AWS explained that S3 had undergone some refactoring and promised to do “further division” in the wake the outage.
AWS also addressed one major casualty of this outage: its Service Health Dashboard. Customers used the Dashboard to check the status individual AWS applications. Unfortunately, the Dashboard was rendered useless for most of the outage. It incorrectly showed impacted services as “operating normally”
“From the start of this event to 11:37 AM PST, we were not able to update the status of individual services on the AWS Service Health Data