Two modes of AI activity are pushing the limits of data center fabrics, driving the need for next-gen testing
“Connectivity is becoming the lynchpin for AI scaling,” said Stephen Douglas, Head of Market Strategy, Spirent, in a session of the recent RCR AI Infrastructure Forum.
For years, the network fabric inside data centers were built for relatively predictable traffic flows. Testing this infrastructure required validating performance against those known patterns and loads. But as AI takes over, its rewriting the rules of testing.
Atypical behavior
“Traditionally, [data center networks] have been designed for high-performance compute architectures. You’re now seeing them evolving from that traditional three-tier fat tree topology to a more streamlined and efficient dedicated back-end architecture,” Douglas noted.
The new two-tier spine-leaf architecture is flatter, more streamlined than the Clos network topology. It requires fewer hops which reduces latency, and provides consistent high-level throughput and lossless communication — all in all a better fit for AI.
The architectural change was made to accommodate AI workloads’ very specific behavior. “AI workloads generate highly parallel bidirectional and bandwidth-intensive flows with very very strict latency and synchronization requirements,” he said.
In classical environments, traffic patterns are largely deterministic. Engineers can anticipate where congestion could occur and rightsize the network accordingly to avoid bottlenecks. By contrast, AI training and inferencing introduce dynamic and non-deterministic communication flows. These are characterized by massive burstiness and latency sensitivity.
The continuous and distributed server-to-server communication makes AI training highly east-west intensive. Transferring the massive datasets demands ultra-high throughputs and zero packet-loss.
“Even minor losses can disrupt synchronization and actually degrade the accuracy of the whole training process,” Douglas said.
Inference traffic is equally demanding, but in a slightly different way. It requires high connection rates and concurrency to support the millions of devices and applications querying the models in real time. “The transaction volumes and sizes vary widely depending on the complexity of each request, leading to bursty and unpredictable intensity spikes,” he noted.
These kinds of data exchanges between compute nodes often trigger issues like inefficient GPU utilization, training integrity issues, buffer overflow, and reduced throughput leading to slower response times.
Testing optimized for AI
The biggest Achilles heel in the data center architecture today is connectivity. If the network is not fast enough or stable enough to keep the data moving steadily between GPUs in real-world conditions, the effects can ripple outward — from degraded user experience to sunk business investments that run into millions.
“Connectivity is becoming the lynchpin for AI scaling,” Douglas said.
To address the challenge, data centers are investing in ultra-high-speed Ethernet technologies pushing towards 1.6T speeds and beyond, and expensive software to orchestrate it, but Douglas argues that testing is essential to unlock the full performance of this fabric.
As AI clusters scale to hundreds and thousands of GPUs and specialized accelerators, testing the performance of the Ethernet fabric — its interconnects, congestion characteristics, and the behavior of RoCEv2 — becomes critical to service continuity. Stress-testing this fabric helps surface bottlenecks, vulnerabilities, usage patterns, things that allow engineers to make an informed guess about what could break the network. Benchmarking in turn provides visibility into metrics, like the fabric’s throughput loss, congestion response, and support for microburst behaviors of AI workloads.
A big part of effective network management is congestion control that ensures that data flows seamlessly without overloading the network. With AI traffic, bursts are as sudden as they are intense. Validating the network under heavy and bursty loads shows how buffer overruns can be avoided to prevent performance issues, and downstream security risks in some cases.
Testing also extends to the collective communication libraries that underpin multi-GPU and multi-node training. These libraries implement all-reduce, all-to-all, and point-to-point communication routines that are unique to AI traffic. Any inefficiency here can directly affect scaling and convergence times.
Other important metrics where testing is essential are job completion times (JCT) and tail latency. These have emerged as key indicators of system health. “They reveal the real business impacts since overall progress in the training is gated by the slowest GPU worker in that sync cycle,” Douglas said.
Lastly, for AI’s east west traffic, encryption is a critical component. Douglas recommends testing cryptographic performance to ensure that encryption overheads do not erode effective bandwidth or add to the training and inference times.
There was a time in the not too distant past when hyperscaler data centers were the primary bastion of AI infrastructure buildouts. As AI trickles through industries, that world is fast changing, giving way to a market where smaller, specialized cloud providers like neoclouds and sovereign AI factories play strong supporting characters in the AI supercycle. In this new reality AI workloads live across everywhere, making testing the a cross-infrastructure imperative.
And as Douglas said, “One thing is clear from the early implementations of the AI architectures: AI traffic is fundamentally different from conventional network traffic, and this is a big reason why testing is so critical.”
