Loading...

Explained: How AWS outage knocked out thousands of websites and smart devices

Explained: How AWS outage knocked out thousands of websites and smart devices
Loading...

On Monday, while India and other Southeast Asian countries were in festive fervour, Amazon Web Services, the largest cloud provider, experienced an outage that led to the halting of thousands of websites and services around the world. The outage lasted for a few hours but exposed how the rigmarole of dependencies and poor fault tolerance can weigh heavily on the day-to-day functioning of the world.

What happened?

While the complete technical details of the outage is yet to be reported by Amazon, it seems to have originated with a DNS (Domain Name System) issue in AWS’s DynamoDB service endpoints.  DNS is a system that converts web addresses into IP addresses and allows browsers to connect to the correct server and load the webpage. The failure to connect to the required website because of the failure to translate domain names is what causes a DNS issue. The DNS issue with AWS was an erroneous update.

Loading...

 One of AWS’ oldest and largest data centers, in Virginia (US-East-1 region), after a technical update to the API of DynamoDB, a cloud database service that stores user information and other important data for online platforms, suffered a problem. An API, or Application Programming Interface, is a set of rules that allows different software applications to communicate with each other. Because of the DNS issue, apps could not find the IP address for DynamoDB’s API and were unable to connect.

Although AWS fixed the DNS issue in a few hours, the problem had already spread — EC2, which runs virtual servers, stopped working because it depends on DynamoDB. Soon after, the system that checks the health of network load balancers also failed. This brought down several other key services like Lambda, CloudWatch, and SQS, along with over 75 others that needed network connectivity. As servers couldn’t talk to each other and new ones couldn’t start, AWS had to slow down EC2 launches and Lambda functions to prevent a total collapse. It took more than 12 hours to restore everything, as AWS engineers worked through a huge backlog of stuck requests.

What services were impacted?

Loading...

Network performance monitor Ookla found that at least a thousand companies worldwide were inaccessible. Popular apps like Reddit, Snapchat, and Duolingo all faced disruptions. Several major platforms Perplexity, cryptocurrency exchange Coinbase, and trading app Robinhood reported service issues. Amazon’s own services, such as its shopping site, Prime Video, and Alexa, were impacted. 

In a rather dystopian turn of events, smart home appliances stopped working as a result of the AWS outage. For instance, Eight Sleep’s smart mattress system that relies heavily on cloud connectivity (via AWS) for key functions like temperature regulation and adjusting bed incline, faced disruption. Users online complained of malfunctioning such as being stuck in unwanted positions, being unable to change the temperature, or overheating. 

Similarly, smart device control platform Switchbots were not able to function properly. “We had a temporary service disruption due to an AWS US East outage. As of Oct 20, 2:30 AM PDT, services in the US & Asia have mostly resumed, though some users may still see intermittent issues,” a tweet from the company’s official X handle said.

What did Amazon say

Loading...

AWS said that between 11:49 PM on October 19 and 2:24 AM on October 20 (PDT), it faced higher error rates across several services in its US-EAST-1 region. The problem also affected Amazon.com, its subsidiaries, and AWS Support systems. The company identified the cause as a DNS issue affecting DynamoDB service endpoints and fixed it in about two hours. Once DynamoDB was restored, most services began to recover, though some internal systems were still slow. 

“To facilitate full recovery, we temporarily throttled some impaired operations such as EC2 instance launches. By 12:28 PM PDT, many AWS customers and AWS services were seeing significant recovery. We continued to reduce throttling of EC2 new instance launch operations while we worked to mitigate the remaining impact. By 3:01 PM PDT, all AWS services returned to normal operations,” the company spokesperson said in a statement.

Blue screen of death disruption in 2024

Loading...

The AWS outage is reminiscent of the Microsoft-CrowdStrike disruption in 2024, mainly due to the scale of impact. Multiple sectors faced disruptions in July due to a technical failure involving Microsoft and cybersecurity firm CrowdStrike. The outage affected businesses not only in India but also in Australia, Germany, the United States, and the UK. Reports indicated that millions of Microsoft Windows users encountered the "Blue Screen of Death" (BSOD), which can cause abrupt system restarts and potential loss of unsaved data.

Microsoft attributed the issue to a "configuration change" in its Azure backend, disrupting connections between storage and compute resources and impacting Microsoft 365 services.  Several Indian airlines, including Air India, Indigo, Akasa Air, and SpiceJet, reported delays due to the outage. Users also struggled to access Microsoft apps like Microsoft 365, Microsoft Teams, and Microsoft Azure.


Sign up for Newsletter

Select your Newsletter frequency