Two modes of AI activity are pushing the limits of data center fabrics, driving the need for next-gen testing
“Connectivity is becoming the lynchpin for AI scaling,” said Stephen Douglas, Head of Market Strategy, Spirent, at the recent RCR AI Infrastructure Forum.
For years, the network fabric inside data centers were built for relatively predictable traffic flows. Testing this infrastructure required validating performance against those known patterns and loads. But as AI takes over, its rewriting the rules of testing.
Atypical behavior
“Traditionally, [data center networks] have been designed for high-performance compute architectures. You’re now seeing them evolving from that traditional three-tier fat tree topology to a more streamlined and efficient dedicated back-end architecture,” said Douglas.
The new two-tier spine-leaf architecture is flatter, more streamlined than the Clos network topology, and therefore requires fewer hops, reducing latency. This provides consistent high-level throughput and lossless communication — all in all a better fit for AI.
“This is [required] due to the radically different traffic being generated by workloads,” Douglas said. “AI workloads generate highly parallel bidirectional and bandwidth-intensive flows with very very strict latency and synchronization requirements.”
In classical environments, traffic patterns are largely deterministic. Engineers can anticipate where congestion could occur and rightsize the network accordingly to avoid bottlenecks. By contrast, AI training and inferencing introduce dynamic and non-deterministic communication flows. These are characterized by massive burstiness and latency sensitivity.
Owing to continuous and distributed server-to-server communication, AI training is highly east-west intensive. Transferring massive datasets requires ultra-high throughputs and zero packet-loss.
“Even minor losses can disrupt synchronization and actually degrade the accuracy of the whole training process,” Douglas said.
Inference traffic is equally demanding, but in a slightly different way. It requires high connection rates and concurrency to support the millions of devices and applications querying the AI models in real time. “The transaction volumes and sizes vary widely depending on the complexity of each request, leading to bursty and unpredictable intensity spikes,” he noted.
These kinds of data exchanges between compute nodes often trigger issues like inefficient GPU utilization, training integrity issues, buffer overflow, and reduced throughput leading to slow response times.
A more rigorous testing approach
The biggest Achilles heel in the data center architecture today is connectivity. “Connectivity is becoming the lynchpin for AI scaling,” Douglas said.
To address the challenge, Douglas argues that testing is a critical enabler of this fabric.
As AI clusters scale to hundreds and thousands of GPUs and specialized accelerators, testing performance of the Ethernet fabric, its interconnects, and the behavior of RoCEv2, ensures that the fabric can feed data to the GPUs at high speed. Additionally, benchmarking provides insight into metrics, like the fabric’s throughput loss, congestion response, and support for microburst behaviors of AI workloads.
Performance testing of collective communication libraries that implement various collective and point-to-point communication routines for multi-GPU and multi-node training is important to ensure that scaling or convergence times are not adversely impacted.
A big part of effective network management is congestion control that ensures that data flows seamlessly without overloading the network. Validating the network under heavy and bursty loads is critical to preventing buffer overrun.
Other important areas where testing is essential are job completion times (JCT) and tail latency. “They reveal the real business impacts since overall progress in the training is gated by the slowest GPU worker in that sync cycle,” he said.
Lastly, for AI’s east west traffic, encryption is a critical component. Douglas recommends testing to ensure that cryptography overheads do not impact the effective bandwidth of the fabric or add to the GPU training and inference times.
There was a time in the not too distant past when hyperscaler data centers were the primary bastion of AI infrastructure buildouts. As AI trickles through industries, that world is fast changing, giving way to a market where smaller, specialized cloud providers like neoclouds and sovereign AI factories play strong supporting characters in the AI supercycle. In this new reality AI workloads live across many networks, making testing the new imperative across infrastructures.
“One thing is clear from the early implementations of the AI architectures is that AI traffic is fundamentally different from conventional network traffic, and this is a big reason why testing is so critical,” Doughlas said.
