Aws Outage Exposes Achilles Heel: Central Control Plane

Analysis Amazon’s US-EAST-1 region outage caused widespread chaos, taking websites and services offline even in Europe and raising some difficult questions. After all, cloud operations are supposed to have some built-in resiliency, right?

The problems began just after midnight US Pacific Time today when Amazon Web Services (AWS) noticed increased error rates and latencies for multiple services running within its home US-EAST-1 region.

Within a couple of hours, Amazon’s techies had identified DNS as a potential root cause of the issue – specifically the resolution of the DynamoDB API endpoint in US-EAST-1 – and were working on a fix.

However, it was affecting other AWS services, including global services and or features that rely on endpoints operating from AWS’ original region, such as IAM (Identity and Access Management) updates and DynamoDB global tables.

While Amazon worked to fully resolve the problem, the issue was already causing widespread chaos to websites and online services beyond the Northern Virginia locale of US-EAST-1, and even outside of America’s borders.

As The Register reported earlier, Amazon.com itself was down for a time, while the company’s Alexa smart speakers and Ring doorbells stopped working. But the effects were also felt by messaging apps such as Signal and WhatsApp, while in the UK, Lloyds Bank and even government services such as tax agency HMRC were impacted.

According to a BBC report, outage monitor Downdetector indicated there had been more than 6.5 million reports globally, with upwards of 1,000 companies affected.

How could this happen? Amazon has a global footprint, and its infrastructure is split into regions, physical locations with a cluster of datacenters. Each region consists of a minimum of three isolated and physically separate availability zones (AZ), each with independent power and connected via redundant, ultra-low-latency networks.

Customers are encouraged to design their applications and services to run in multiple AZs to avoid being taken down by a failure in one of them.

Sadly, it seems that the entire edifice has an Achilles heel that can cause problems regardless of how much redundancy you design into your cloud-based operations, at least according to the experts we asked.

“The issue with AWS is that US East is the home of the common control plane for all of AWS locations except the federal government and European Sovereign Cloud. There was an issue some years ago when the problem was related to management of S3 policies that was felt globally,” Omdia Chief Analyst Roy Illsley told us.

He explained that US-EAST-1 can cause global issues because many users and services default to using it since it was the first AWS region, even if they are in a different part of the world.

Certain “global” AWS services or features are run from US-EAST-1 and are dependent on its endpoints, and this includes DynamoDB Global Tables and the Amazon CloudFront content delivery network (CDN), Illsley added.

Sid Nag, president and chief research officer for Tekonyx, agreed.

“Although the impacted region is in the AWS US East region, many global services (including those used in Europe) depend on infrastructure or control-plane / cross-region features located in US-EAST-1. This means that even if the European region was unaffected in terms of its own availability zones, dependencies could still cause knock-on impact,” he said.

“Some AWS features (for example global account-management, IAM, some control APIs, or even replication endpoints) are served from US-EAST-1, even if you’re running workloads in Europe. If those services go down or become very slow, even European workloads may be impacted,” he added.

Any organization whose resiliency plans extend to duplicating resources across two or more different cloud platforms will no doubt be feeling smug right now, but that level of redundancy costs money, and don’t the cloud providers keep telling us how reliable they are?

The upshot of this is that many firms will likely be taking another look at the assumptions underpinning their cloud strategy.

“Today’s massive AWS outage is a visceral reminder of the risks of over-reliance on two dominant cloud providers, an outage most of us will have felt in some way,” said Nicky Stewart, Senior Advisor at the Open Cloud Coalition.

Cloud services in the UK are largely dominated by AWS and Microsoft’s Azure, with Google Cloud coming a distant third.

“It’s too soon to gauge the economic fallout, but for context, last year’s global CrowdStrike outage was estimated to have cost the UK economy between £1.7 and £2.3 billion ($2.3 and $3.1 billion). Incidents like this make clear the need for a more open, competitive and interoperable cloud market; one where no single provider can bring so much of our digital world to a standstill,” she added.

“The AWS outage is yet another reminder of the weakness of centralised systems. When a key component of internet infrastructure depends on a single US cloud provider, a single fault can bring global services to their knees – from banks to social media, and of course the likes of Signal, Slack and Zoom,” said Amandine Le Pape, Co-Founder of Element, which provides sovereign and resilient communications for governments.

But there could also be compensation claims in the offing, especially where financial transactions may have failed or missed deadlines because of the incident.

“An outage such as this can certainly open the provider and its users to risk of loss, especially businesses that rely on its infrastructure to operate critical services,” said Henna Elahi, Senior Associate at Grosvenor Law.

Elahi added that it would, of course, depend on factors, such as the terms of service and any service level agreements between the business and AWS, the specific causes of the outage and its severity and length.

“The impacts on Lloyds Bank, for example, could have very serious implications for the end user. Key payments and transfers that are being made may fail and this could lead to far reaching issues for a user such as causing breaches of contracts, failure to complete purchases and failure to provide security information. This may very well lead to customer complaints and attempts to recover any loss caused by the outage from the business,” she said.

At 15.13 UTC today, AWS updated its Health Dashboard:

“We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations.”

Thirty minutes later, it added:

“We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches.” ®


Original Source


Support Our Work

A considerable amount of time and effort goes into maintaining this website, creating backend automation and creating new features and content for you to make actionable intelligence decisions. Everyone that supports the site helps enable new functionality.

If you like the site, please support us on Patreon or Buy Me A Coffee using the buttons below.

AI APIs OSINT driven New features