Beyond high-profile global consumer and consumer-enterprise disruptions, the AWS and Vodafone outages this month show how Industry 4.0 can fail without proper cloud and network redundancy.
Fallible cloud – even highly redundant hyperscalers like AWS can fail, revealing hidden single points of failure that ripple through global industries.
OT resilience – industrial operations require data to stay on-site; cloud-edge systems can still fail, highlighting the need for independent edge architectures.
Layer zero – edge networks, network redundancy, and network diversity are as critical as servers to ensure continuity when public clouds go down.
It has taken a couple of days, but, then, there is a lot to unpick from the AWS outage that tore through the global economy this week. Layer-in the Vodafone outage in the UK a week ago – plus the Nexperia shutdown in the Netherlands, if we are to consider the physical lines of business in Industry 4.0, as well as the digital ones – then we have a total industrial cluster-f@ck, and a stark warning for enterprises, industries, governments about inherent points-of-failure in world-conquering digital infrastructure monopolies. It is also about private 5G, of course. (It’s not, really, but we can make it so.) Anyway, lots to consider.
The AWS outage on Monday (October 20) was from a back-end error in its domain name system (DNS) at a ‘US-East’ data centre in Virginia; the Vodafone outage last Monday (October 13) was a software issue with one of its network vendors. Neither was a cyber attack; both were resolved the same day. But between times, they both killed digital services for countless enterprises: the DNS error at AWS saw failures at 150-odd major internet platforms, as reported, including at banks Lloyds and Halifax (via cloud dependencies) on the other side of the Atlantic; the issue at Vodafone downed broadband and mobile comms for “hundreds of thousands”.
The cost of the AWS fiasco, in particular, sounds dramatic: estimates range from around $75 million per hour in direct (collective) losses to hundreds of billions for the entire global ripple-effect. Point is, this hide-your-face narrative about ‘single points of failure’ in the all-digital economy are up for discussion, again – as they were, most memorably, after the CrowdStrike outage in July last year, which took millions of Windows devices offline and disrupted airlines, hospitals, and retailers worldwide (to the tune of $5.4 billion in damages). Interestingly, this Nexperia incident, while different, brings another angle about the fragility of interconnected business in a global-capitalist economy.
It is an aside, but a telling one: last Monday (week), the same day Vodafone went down, the Dutch government took control of local chipmaker Nexperia under the terms of the Goods Availability Act on the grounds of national security of critical goods, related to its ownership by China-based Wingtech. On Tuesday this week (October 21), China imposed export restrictions to further disrupt the flow of Nexperia components to Europe – into automakers like BMW and Volkswagen, impacting production schedules in their factories. And so, it is another closely tangled mess, wound up in concentrated points of failure, physical or digital, in globalised supply chains.
But back to AWS: roughly 70 percent of the global cloud market runs through AWS, Azure (Microsoft), or GCP (Google). Many enterprises still rely on single regions or single providers. Leonard Lee, founder at NextCurve, reflected: “We need to remember that AWS cloud is not a monolith. It is highly redundant, resilient, highly performant, and available by design. Customers will likely be working with AWS to figure out how to make their deployments more durable.” This may be so, but even well-designed systems can expose enterprises to single points of failure, especially when dependencies, hidden or obvious, span multiple geographies and functions.
Indeed, Lee’s response to the DNS diagnosis is telling. “I struggle with this notion, given the scale and scope of the outage,” he said. So given this hyperscaler-sophistication and availability-by-design, and the out-of-the-blue chaos caused by a simple DNS error, how can a UK firm (a bank, say; the people’s cash register, ironically) be taken offline by a data-centre outage in the US? The answer lies in these hidden dependencies: critical workloads, third-party services, and APIs may all reside in a single point-of-failure, somewhere in Virginia. Even hybrid cloud strategies only work if multi-region redundancy and failover processes are actively implemented.
Otherwise, the cloud’s ‘resilience-by-design’ shtick will not fully protect enterprise operations – compounded as economic disruption, and systematic risk. Dean Bubley, founder at Disruptive Analysis, zooms-out, and sums-up: “We are entering a dangerous period in terms of geopolitics, hybrid warfare, and cybersecurity. Yet so much of our essential network and cloud infrastructure appears to have single points of logical failure, even if there’s physical resilience and redundancy. Often a single misconfiguration can take multiple systems offline. There’s no point having backup data centres or network paths, if they all use the same peering point or network identity,” he said.
Such technical outages are symptoms of a wider fragility; concentrated control and dependency in interconnected digital ecosystems, exposing national economies to systemic failures. Bubley reflected: “We have to worry about over-centralisation of control of [digital] ecosystems, and the commercial and financial dependence between major firms. There’s been debate about the circularity of investments between OpenAI, Nvidia, Oracle, others. But the same is true of a lot of connectivity businesses – including with infra-sharing, as well as cloud. And Europe should be wary of replicating its own local circularity [in the name of ‘sovereignty’], just without the same capital and scale.”
The received wisdom to withstand such outages says enterprises should spread their bets, of course, in multi-cloud and hybrid-cloud setups, so data and applications are distributed across more than one cloud provider, and where they combine on-prem infrastructure with big public cloud engines. The lesson from the AWS and Vodafone outages isn’t just to add more backup systems – it is to build an architecture that expects things to fail, and keeps critical functions running regardless. So why haven’t enterprises done this already? Why won’t they have done this by the time of the next big digital-infrastructure fail? Because surely by now they know the rules of the game.
Truth is that most enterprises just can’t apply them – technically, economically, or organisationally. There is a convenience trap, too, just like with buying from Amazon Prime: cloud and network ecosystems are really good. Big cloud providers – major telcos too, to an extent – offer global reach, elastic scaling, and managed-everything at a fraction of the cost of doing it in-house. So most enterprises – even critical ones – accept some kind of dependency trade-off just for convenience. Because building and maintaining multi-cloud, multi-network resilience is expensive and complex, especially for legacy environments.
Until recently, regulators didn’t treat hyperscaler or telco dependency as systemic risk. Now, frameworks like the Digital Operational Resilience Act (DORA; for financial entities in the EU), the Network and Information Security Directive 2 (NIS2; operators of essential services and critical infrastructure in energy, transport, health, digital infrastructure, and manufacturing), and UK Operational Resilience (also financial services firms) are forcing firms to show they can withstand third-party failures. But the rules are still catching up, particularly for hyperscalers, largely unregulated as “critical” entities – and enforcement varies across regions and industries.
John Strand, founder at Strand Consult, has an excellent – and also angry – analysis of this (worth seeking out). He writes: “The AWS outage might seem a small price to pay for the high quality and value it provides. After all, the disruption was unintentional – a backend mistake – and AWS delivers many benefits through its scale and efficiency. But smaller enterprises, especially telecom providers, face far stricter regulatory standards…. It is difficult to fathom why AWS, with a market cap in the trillions of dollars, gets a pass… AWS consistently lobbies against financial contributions that could support more accessible and resilient access networks.”
The last point refers to its campaign – in concert with other behind-the-scenes cloud engines and ‘over-the-top’ (OTT) content providers – against “fair share” or network usage fee proposals, mainly in Europe, to make big tech and cloud firms contribute to the cost of telecom and broadband infrastructure they rely on. It is a gnarly issue, but Strand’s argument is a tough one. “AWS has funded reports claiming that requiring it to contribute financially to such programmes would devastate economic growth, often citing doomsday scenarios. Network usage fees are what customers pay to AWS to use its networks and services – and somehow it’s wrong for competitors to charge these.”
Outages will happen, of course, but any argument about how palatable it is for enterprises to tolerate the odd fail – fail smart, recover fast, keep the core alive – shifts in critical Industry 4.0, away from fluffier enterprise disciplines in the AWS fall-out (Snapchat, Roblox, Pokémon Go; Ring, Slack, Zoom; plus the high street banks we discussed), where downtime is business-critical, sometimes life-critical. OT systems cannot tolerate the same downtime as IT workloads; operational continuity matters more than contractual compensation. A four-nines (99.99 percent) cloud-level uptime SLA might sound safe, but it implies almost an hour of downtime per year – out of the blue.
Which is why the industrial edge, between enterprise-managed on-site data centres and regional hyperscaler ‘outposts’, matters, of course. Lee says: “Cloud players have had challenges with the different varieties of edges. This incident only serves to support the argument for OT isolation from the public cloud for industrial computing and data. Most of these industrial environments are going through organic cloud modernization. The present is the edge for Industry 4.0.” A source adds further nuance, making explicit the architectural distinction between dependent and independent edge models – and thereby exposing why some organisations remain vulnerable
“Mission-critical industrial operations require OT data to be processed on site, and remain on site, in order to meet security and sovereignty requirements, low latency for process automation, and also to lower external dependencies in order to meet industrial reliability and availability requirements. There are many different edge-plus-cloud approaches. The ones the cloud companies tend to use are where the edge is a constantly synced image of the cloud – and so you are in trouble rapidly as things get desynced (in a few minutes to a few hours) so they do not ride cloud or transmission problems. When the edge is independent, it is more reliable in case of cloud failure.”
It subverts the misconception that the ‘edge’ brings resiliency by itself. Many cloud-linked ‘edge’ systems are really cloud extensions, not autonomous systems; if the edge depends on continuous synchronisation with the cloud, it still fails when the cloud fails – just with a delay. So it is not about backup or recovery, but about continuity without external dependencies. In Industry 4.0, the system must keep functioning even when disconnected. Which means the control logic, analytics, and decision-making have to stay on site – at the far edge. In Industry 4.0, the cloud is a coordination or analytics layer, not a runtime dependency.
It also suggests a hidden weakness in edge ‘as-a-service’ models by pointing out that cloud vendors’ edge implementations often rely on a near-constant sync cycle, which is fragile in disconnection scenarios. A cloud edge is still a cloud dependency, after all. As an adjunct, but as promised, the private 5G movement is, in ways, a parallel and complementary response to this same edge/cloud fragility in Industry 4.0 – to impose order and control order over OT data, so the plant stays connected, the data stays active, even if the public cloud or network goes dark.
Will Townsend, vice president and principal analyst at Moor Insights & Strategy, remarks: “[The outage] provides a strong argument for ensuring that organizations that manage mission-critical systems and infrastructure have reliable secondary connectivity such as cellular redundancy and link diversity.” Which is deceptively simple – that resilience is not just about servers and software, but about the connectivity itself. The enterprises impacted by the Vodafone outage could have said the same; it is not always about where the workloads run, but about the paths in between. If your control paths are hitched to a single network provider, your higher-up redundancy doesn’t matter.
Point is that proper resiliency starts at the bottom later (‘Layer 0’), with connectivity diversity; it also, implicitly, makes the case for the private/edge network movement. Private cellular networks are, by design, a form of link diversity: they allow on-site devices and systems to stay connected even when external links fail; they provide an independent path for critical data and control traffic; they can the fallback traffic for machine comms, robotics systems, camera vision, industrial IoT – if they are not the primary conduit, and the main enterprise network drops. Enterprises that are thinking about private 5G for more than just latency likely have their edge/cloud resiliency cracked – or in mind anyway.