YOU ARE AT:Telco CloudAutomated scaling activity caused Tuesday's AWS East failure, says Amazon

Automated scaling activity caused Tuesday’s AWS East failure, says Amazon

Amazon provided additional details about last week’s AWS outage, which put many cloud-based businesses and services out of touch for hours on Tuesday. Some services hosted on or dependent on resources in some data centers within AWS’s east-1 region went unresponsive on Tuesday following an internal networking failure. 

During the outage, AWS us-east-1 customers couldn’t access services including EC2, Connect, DynamoDB, Glue, Athena, Timestream and Chime. Popular streaming services affected by the outage included Disney Plus and Netflix. The AWS outage stranded users of dating app Tinder, Cryptocurrency service Coinbase and cash app Venmo. Players couldn’t launch popular video games like PUBG and League of Legends. Some Amazon couriers and package delivery drivers were unable to do their jobs, as well.

AWS Regions are physical locations throughout the globe where the company operates data centers and connects its Wide Area Network (WAN). Because of its location and diversity of services, a lot of AWS customers rely on the east-1 region.

The east-1 region is in Northern Virginia in “Data Center Alley.” Seventy percent of world’s Internet traffic passes through the region, according to estimates. As a result, aws-east-1 is hugely popular with AWS customers. It’s also AWS’ most diverse Region, with more Activity Zones and Local Zones than anywhere else. Activity Zones are logical collections of data centers within a Region. Local Zones contain edge computing resources.

Those problematic scaling processes triggered a surge of connection activity that overwhelmed networking devices, according to AWS.

“At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” the company noted.

”Previously unobserved behavior”

“These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks,” said AWS.

Complicating mitigation efforts, AWS operators were flying blind, according to AWS’s notes. 

“Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors,” said AWS. It would take further action and several more hours before things returned to normal. Why?

“First, the impact on internal monitoring limited our ability to understand the problem. Second, our internal deployment systems, which run in our internal network, were impacted, which further slowed our remediation efforts,” said AWS.

AWS said that it won’t resume AWS East scaling activity which caused the outage before it tests remediations.

“Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event,” said AWS. 

The automated scaling activity triggered a “previously unobserved behavior,” for which AWS engineers are developing a fix presently. The company expects to see it deploying within the next two weeks. 

AWS said it’s reworking its Service Health Dashboard to provide more accurate and timely information. AWS also made additional network configuration changes to protect impacted devices even if a similar event happens again.

“These remediations give us confidence that we will not see a recurrence of this issue,” said the company.

ABOUT AUTHOR