Between the global shift to remote workforces and the requirements for 100 percent uptime for healthcare and emergency services, the COVID-19 pandemic is putting unprecedent strain on our networks. While no one could have possibly predicted this pandemic or its impact on every facet of our lives, we can confidently forecast the continued need for massive bandwidth as the critical need to exchange information in real-time marches on. If there is one word you could use to describe the internet and today’s networks it just might be, “scale.”

So, scale is essentially what we have been solving for in network design. And to be more precise, scale in an efficient, economic way. Scale can be thought of in multiple dimensions: the scale to handle additional numbers of logical endpoints, such as users or thermostats; scale in terms of geographic reach; the ability to scale to add new services, from voice to virtual reality. The list goes on.

But we can no longer solve only for scale. New applications with diverse requirements have placed growing demands on the network. And new, lightning fast, competitors have entered the telecommunications ecosystem, taking customer wallet share from traditional network operators and introducing not only new ways to design and operate networks, but also completely new ways to run a business. For example, AWS now offers more than 212 different services.

The challenge for an incumbent network service provider (SP) is no longer simply how to scale the network. No doubt most network domains are marching toward 400GbE, but it’s not that simple. SPs must also solve for business agility to stay competitive. These two drivers, scale vs. agility, create conflict that requires some design tradeoffs.

Scale up

Until recently, the way in which SPs solved for scale in their networks, particularly the core, has been through “scale up” architectures. Scale up, or scaling vertically, in a general sense involves making an existing network node bigger, for example, adding capacity to existing modular routing platforms or replacing platforms entirely with bigger boxes. This “brute force” traditional approach is simple and effective, but it has limitations. After chassis capacity is exhausted, additional capacity must be added in large, coarse chunks in the form of a new chassis.

Scale up designs struggle with the ongoing mismatch between business requirements (e.g., dynamic, unpredictable bandwidth demand) and what the network can support. As the network grows, typically a network operator finds itself in either under-capacity or over-capacity. In an under-capacity situation, SPs risk losing revenue from missing out on new customers or losing existing customers because you cannot adequately serve them. Over-capacity, on the other hand, equals over-spending.

The large nodes of scale up designs also mean large blast radius when things go wrong. Therefore, reliability requires major redundancy. Multi-plane core scale up designs do increase availability, but at additional cost and complexity.

Scale out

As networks get bigger, they get more complex. Fundamentally, SPs manage this complexity by separating infrastructure into different domains: access, aggregation, core, and so on. Data centers (DCs), however, are flat and cannot be separated in a similar fashion. Instead, architects generally manage the complexity by building layers, such as leaves and spines. As DCs get bigger designers typically add more layers and more hierarchy. But to achieve the massive scale necessary to handle the explosion in traffic over the last 10 years, large cloud providers have implemented scale out thinking as well.

Modern data center leaf-spine architectures with IP fabric physical underlays are the prototypical examples of scale out designs in networking. Operators add capacity by installing a new rack of servers connected to the spine with a top of rack switch. The migration to more capacity is relatively seamless. The cloud hyperscalers have perfected these designs as a response to their most important requirement: efficient scale to handle hyper-growth.

As these scale out DC fabric designs performed admirably, the next question became, can we apply the same principles to core and wide area networks (WANs)? In contrast to scale up, scale out essentially means adding more network nodes in parallel and linking together these nodes to that they can collectively do the work of a much larger scaled up node. Nodes in scale out architectures are typically smaller, lower performance vs. nodes you would find in scaled up networks. But, several 1RU boxes with 400GbE interfaces quickly add up to a single, large modular chassis.

The shift to scale out architectures in the core also requires a change in mindset. “Just be like Google” or another cloud hyperscalers is a naïve admonishment occasionally and casually directed toward an SP, but this is unfair. Traditional SPs have billions of dollars of investment in the ground, much of which has been designed for a different era. But the truth is that SPs can learn a lot from the hyperscalers, including basic hyperscaler network design principles such as:

Superior scale
Better resilience; smaller blast radius
Open, interchangeable components
Uniform, automated operations

Scale onward

To be clear, most SPs still don’t need the full Clos-style, massive scale out architectures that hyperscalers operate – the additional nodes, additional layers, and additional links all add up to massive new complexity. The hyperscalers effectively tame this complexity only through automation and an army of talented software engineers. The cloud hyperscalers have solved the tradeoff between network efficiency and business agility, but traditional SPs have yet to build up the required capabilities.

A middle way is, perhaps, most optimum for the SP core. Call it, limited scale out. This includes multi-device nodes vs. just two nodes for added redundancy and some limited leaves for fan out. Call it, “compromised scale out” that works for most SPs today. Over time the SP can add more leaves, resulting in a smoother evolution and transition path from where they are today, both architecturally and organizationally, to where they want to get to.

Remember, scale up vs. scale out is not an either/or discussion. Scale out certainly is not always “better.” It’s all about the right architecture for the job. Look for an infrastructure vendor that covers all network domains, with a portfolio that covers scale up and scale out use cases and 400GbE speeds. And most importantly, look for an infrastructure vendor, who has worked extensively with the cloud hyperscalers to help them build out their massively scaled out networks and massively successful businesses.

Useful Links

Edtior's Picks

Latest Articles

Up and out: The dynamics of scale in core networks (Reader Forum)