Blames S3 Outage on Human Error

Posted March 03, 2017

Though the majority of sites affected have since gone back online, some appear to still be facing issues.

Amazon Web Services (AWS) has explained the hours-long service disruption that caused many websites and Internet-connected services to go offline earlier this week. The problems caused websites and apps to become completely unavailable, while others indicated broken links and images, leaving users and companies around the globe frustrated and/or confused. Here's what happened, according to Amazon, at 9:37 a.m. The problem apparently emerged from the S3 cloud storage service of Amazon.

An S3 team member was attempting to execute a command that would remove a small set of servers for one of the S3 subsystems used by the billing system. That command was part of an established Amazon playbook.

Because numerous S3 servers require others to work properly, the mistake caused a waterfall of outages. One of them "manages the metadata and location information of all S3 objects in the region", Amazon said. The other manages where new items are stored.

AWS says that its system is created to allow the removal of big chunks of its components "with little or no customer impact". AWS says it was able to restore full S3 service and operations by 1:54 PM PST, almost four and a half hours later. While that was happening, S3 couldn't deal with requests for objects - it was effectively turned off for the websites that depended on it.

Trump to Sign Executive Orders on Water Rule, Black Colleges Initiative
Science shows that specific water features can function like a system and impact the health of downstream waters. Thirty-two states filed suit against EPA and the Corps to get federal judges to strike down the rule.

Diary shows BJP leader's connection with traffickers: Bengal CID
West Bengal CID sources said Juhee Choudhury had hidden in a relative's residence in Batasi village near the worldwide border. Ghosh said the party was not calling her innocent or guilty at this stage, but the law of the land should prevail.

Microsoft relaxes some Windows 10 update and upgrade rules
With most apps being far from stateless these changes are long overdue, and as a Windows 10 user very welcome. With the Creators Update you will have several new options for scheduling the timing of when updates install.

At one point, the "dashboard" where Amazon tells its users which of its services are now operational wasn't working because of the S3 issue.

Amazon said that it designed the system to be able to work even when "significant capacity" was removed or failed.

It meant that even though the subsystems are created to keep working with minimal customer impact when capacity fails, the restart process took more time than it should. Also, removing so much server capacity required a full system restart, which then took longer than expected, AWS said. An employee entered what they thought was a routine command to remove servers from an S3 subsystem. Amazon is also auditing other operational tools to ensure that they have similar safety checks and will "make changes to improve the recovery time of key S3 subsystems".

In a published postmortem on the incident, AWS said that "we want to apologize for the impact this event caused for our customers".