Blames S3 Outage on Human Error

Posted March 03, 2017

Though the majority of sites affected have since gone back online, some appear to still be facing issues.

Amazon Web Services (AWS) has explained the hours-long service disruption that caused many websites and Internet-connected services to go offline earlier this week. The problems caused websites and apps to become completely unavailable, while others indicated broken links and images, leaving users and companies around the globe frustrated and/or confused. Here's what happened, according to Amazon, at 9:37 a.m. The problem apparently emerged from the S3 cloud storage service of Amazon.

An S3 team member was attempting to execute a command that would remove a small set of servers for one of the S3 subsystems used by the billing system. That command was part of an established Amazon playbook.

Because numerous S3 servers require others to work properly, the mistake caused a waterfall of outages. One of them "manages the metadata and location information of all S3 objects in the region", Amazon said. The other manages where new items are stored.

AWS says that its system is created to allow the removal of big chunks of its components "with little or no customer impact". AWS says it was able to restore full S3 service and operations by 1:54 PM PST, almost four and a half hours later. While that was happening, S3 couldn't deal with requests for objects - it was effectively turned off for the websites that depended on it.

Trump 'Became President of the United States Tonight — CNN's Van Jones
Government officials say a deadly usa raid in Yemen didn't yield any significant intelligence, contradicting previous reports . On Fox and Friends Tuesday morning, Trump said he understood the family's pain. "People are actually being hurt", Maher said.

Spotify Hi-Fi: New high quality service in testing
Would you be willing to pay extra to experience lossless-quality music? Feel free to sound off in the comments section below! While lossless audio may not be a priority for most consumers, audiophiles will certainly appreciate the new feature.

DNA Tests Finds that Subway's Chicken is Just 50% Chicken!!
In a statement to the CBC , Subway said that its recipe "calls for 1% or less of soy protein in our chicken products". But it's important to note once chicken is seasoned and processed, it's not expected to hit that ideal percentage.

At one point, the "dashboard" where Amazon tells its users which of its services are now operational wasn't working because of the S3 issue.

Amazon said that it designed the system to be able to work even when "significant capacity" was removed or failed.

It meant that even though the subsystems are created to keep working with minimal customer impact when capacity fails, the restart process took more time than it should. Also, removing so much server capacity required a full system restart, which then took longer than expected, AWS said. An employee entered what they thought was a routine command to remove servers from an S3 subsystem. Amazon is also auditing other operational tools to ensure that they have similar safety checks and will "make changes to improve the recovery time of key S3 subsystems".

In a published postmortem on the incident, AWS said that "we want to apologize for the impact this event caused for our customers".