One Amazon employee took down almost half the internet with a single typo. Here’s what happened, and how to protect your business from costly mistakes across all kinds of systems and media.
Amazon’s Typo Takedown As It Happened
On February 28, 2017, Amazon Web Services’s S3 experienced a major service disruption. S3 hosts hundreds of thousands of websites and apps. The interruption came suddenly and lasted almost four hours. A couple days later, Amazon announced what caused the outage.
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
That’s right. Amazon’s typo broke the internet. The employee was trying to remove servers that supported one part of the S3 ecosystem, but the typo also took down the index subsystem, which manages the location and metadata of S3 objects in the region. The disruption then cascaded to the placement subsystem and several other critical subsystems.
Amazon’s Typo Aftermath
Restoring service required restarting the affected servers. However, Amazon was caught off guard by how long the procedure would take.
We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.
S3 finally recovered at 1:54 PM PST, hours after the incident began. Amazon Web Services owns about 40% of the cloud services market. During that time, all kinds of e‑commerce businesses and apps ground to a halt. One estimate suggests that the typo cost companies in the S&P 500 index about $150 million.
Amazon’s typo system outage is a cautionary tale for all kinds of professionals. Typos are not just a writers’ problem. Engineers, social media users, marketers, civil engineers, and many others are all subject to typo mistakes. Everyone needs a strategy to prevent typos. Here’s what we suggest:
1. Check Your Work Before It Matters
Just before any kind of written communication goes live, proofread it. Better yet, have someone else proofread it.
2. Use Redundant Redundancy
Amazon’s typo outage could have been much worse without redundant systems. Even if a critical typo goes live, having a backup plan can limit the consequences of an error, or even save your organization.
3. Practice Prevents Surprises
Amazon was surprised by how long the system restart took because that procedure hadn’t been completed in many years. Despite our best efforts (see point one), errors do happen sometimes. Imagine the worst-case scenario, and find ways to test your organization’s readiness.
4. Find Some Help
Who is your expert helping you prevent typos? If you don’t have one, we may be able to help.