Scale-up, scale-out, scale-across take centerstage at OFC

Kannan Raj, architect, Oracle pulled back the curtain on the three dimensions of connectivity that define AI-era networking, and how optical innovation is enabling the globe-spanning AI network fabric

One of the tech industry’s biggest worries was in full display at OFC in Los Angeles last week. AI is everywhere, and it has the potential to break the current data center fabric. 

“Back when the IEEE specs were formed and written off, they said, we need links to have 2.4e-4 effect error. That by no means is acceptable today,” said Kannan Raj, AI infrastructure architect at Oracle during a panel discussion. “We cannot deliver a healthy fabric with that kind of a specification.” 

It was made amply clear that AI is not just another new application; it is forcing data centers to scale up, out, and across. That message was delivered at the heart of OFC, serving a reminder that the network from inside out is evolving.

“We are dealing with millions of links, millions of units, components, hardware. I call it the tyranny of large numbers,” Raj said. “When you have large numbers at operation, things will fail. Things will break,” noting that the mean time to failure is especially short with that kind of number.

Despite these, the fabric must remain functional as any single event of failure can disrupt large training workloads, sending them back to the previous checkpoint, causing huge loss of time and resources. Needless to say, when model parameters run in the billions and trillions, these small wrinkles amount to massive shockwaves across the training cycle. 

20260317 114522
Kannan Raj speaking at the panel discussion, “State of the Industry: Now and in 2031” at OFC in Los Angeles

A similar inflection point like AI was witnessed a while back when applications like video streaming, content delivery, and cloud services rose to prominence, prompting companies to make incremental upgrades to their networks to support higher bandwidth and lower latency requirements. Once again, AI workloads are demanding a change — only this time, it requires a fundamental reset.

Scale up, out, and across

Hyperscalers, neoscalers, service providers are all looking at three types of connectivity inside and between data centers to make resources available to AI workloads: scale-up, scale-out, and scale-across. 

Scale-up refers to connecting a large number of GPUs within the same compute clusters. Adding new resources to the existing system or moving the workloads to a beefier one serves the workloads with the extra resources they need. Being inside the box, scale-up offers some unique advantages, including ultra-low latency and lossless connectivity. 

Scale-out is a step up from the scale-up model. Here clusters are connected across multiple racks to achieve parallelism by grouping machines. A preferred option for many resource-hungry applications, it yields better performance by breaking the physical limitations of servers and chassis. On the flip side, the communication is now more network-dependent which means the fabric needs to have minimal degradations as AI workloads are intolerant to performance dips. 

Scale-out pushes workloads outside the data centers. It involves interconnecting multiple data centers spanning geographies to create what is now called “AI factories”. This giga-scale cluster of data centers functions as a single logical unit with thousands of interconnected GPUs communicating across campus and regions. 

“Scale-up is highly localized,” Raj explained. “It’s synchronous, message passing type, low latency…Scale-out is basically within the pod. So it’s suitable for running inference. Scale-across can vary [depending] on who you talk to. It can be 10 kilometers to thousands of kilometers.”

These architectures define the core framework of networking in the AI era. 

Optical solutions and interconnects enabling modern network architectures

To make these new architectures possible, newer optical technologies and interconnects of higher caliber are required to provide high-density, low-latency, long-reach connectivity. This where technologies, like, linear pluggables, coherent optics, co-packaged optics, multi-rail amplifier huts enter the discussion. 

In high-capacity optical transport, 400G, 800G, 1.6T and beyond offer high spectral efficiency, support for longer distances, and low-power, space-efficient form factors. 

“Scale-up today is mostly…copper and it is getting to a point where there will be some hybrid optical-copper essentially, and then transitioning to optical also. But scale-out is either a DR [Direct Reach] link or an FR [Far Reach] link and so that is where the pluggables play a big part,” Raj said.

“The type of optics that gets used for scale-across can span a wide range,” he added. “It could be FR optics, it could be coherent light, or it could be ZR [Extended Reach] optics. So, scale-across has got more deployment considerations there.”

However, Raj pointed out that the distinction between the architectures are gradually fading, making resiliency the ultimate factor. “The distinction between a scale-up and scale-out, they are blurring quite a bit. And, again, in all of this, we want to make sure that there is resiliency…multiple racks constitute a scale-up network right now.”

Furthermore, Raj talked about multi-planar network fabrics that are becoming increasingly important for building ultra-large-scale AI clusters. This network architecture merges multiple independent Clos fabrics into one, supporting massive scale AI clusters.

“Multi-planar is actually a more flatter topology, larger RADIX. It’s still a two-tier Clos….Every GPU is basically communicating to every other GPU…it’s a way of dividing and conquering on resources that are available but it’s basically allowing us to expand the network to make a larger domain here,” he noted.

“RDMA [Remote Direct Memory Access] unlike TCP is very very unforgiving when it comes to the performance of the network. So we have to make sure that there is robust interoperability. We have a lot of issues related to link flaps and so on. There are deterministic link flaps [and] nondeterministic link flaps. So the industry has to pay attention to how to eliminate those kinds of problems,” he said.

ABOUT AUTHOR

Sulagna Saha
Sulagna Saha
Sulagna Saha is a technology editor at RCR. She covers network test and validation, AI infrastructure assurance, fiber optics, non-terrestrial networks, and more on RCR Wireless News. Before joining RCR, she led coverage for Techstrong.ai and Techstrong.it at The Futurum Group, writing about AI, cloud and edge computing, cybersecurity, data storage, networking, and mobile and wireless. Her work has also appeared in Fierce Network, Security Boulevard, Cloud Native Now, DevOps.com and other leading tech publications. Based out of Cleveland, Sulagna holds a Master's degree in English.