Why the bottleneck always moves, why economics is an equal partner, and why capital recovery in AI infrastructure now depends on workload liquidity.

Every infrastructure supercycle has two engines, not one. The technical engine is bottleneck migration. Workloads expose a constrained layer, capital chases it, balance restores, and the next layer becomes binding. The economic engine is an asset class. How the dominant unit of compute is financed, depreciated, utilized, and amortized determines who builds it, who rents it, and where margin accumulates. Most analyses focus on the first engine and treat the second as a postscript.

But the CPU era and the AI era diverge as much on the economic axis as on the technical one, and the two divergences reinforce each other. This article works through both.

The technical pattern is sequential bottleneck migration in the CPU cycle and concurrent co-design in the AI cycle. That’s one half of the argument. The economic pattern is CPU servers as a balanced capital and operating asset, and GPU servers as a capital-dominated asset whose useful life depends on a cascading workload mix. That’s the other half. The third element, which turns out to be where the two engines meet, is the serving and orchestration layer that determines whether the asset’s accounting depreciation actually converts into realized cash flow.

I’ll flag uncertainty where numbers are ranges rather than facts, and call out where the thesis depends on assumptions that may not hold.

Part 1 | The bottleneck migration law

Imagine a kitchen with one chef, one prep cook, and one dishwasher. Chef plates 60 dishes/hour. Prep cook preps 60/hour. Dishwasher cleans 60/hour. Throughput: 60.

Add a second chef. Throughput doesn’t double. It stays at 60. The prep cook is the bottleneck now. Add a second prep cook. Throughput briefly hits 120, and then the dishwasher chokes.

This is Amdahl’s Law (speedup is bounded by the slowest part) combined with queueing theory (when a stage saturates, work piles up in front of it). Together they yield the operative corollary: When you accelerate one stage faster than the others, you create a forcing function and a capital opportunity at the next stage downstream.

Throughput per dollar invested in the fast stage actually falls until the downstream stage catches up. The market has no choice. It must spend on the bottleneck. That’s not strategy. It’s arithmetic. This is the engine of every infrastructure supercycle.

There’s a corollary on the economic side that gets less attention. Whichever component dominates the cost stack of the dominant compute unit also dominates the business model around it. When capital is cheap and operating costs matter, the optimization target is utilization through time-sharing.

When capital is expensive and depreciation is uncertain, the optimization target is what we might call workload liquidity, which is the ability to continuously match changing workloads to changing hardware generations across training, prefill, decode, batch, and lower-priority inference. These produce entirely different industry structures.

Part 2 | The CPU-era supercycle ran sequentially, and rented well

Technical pattern: three waves in sequence

A 2005 production server typically had two single-core sockets at 3 GHz. Dual-socket was already standard. By 2009 the same 2U chassis carried two quad-core Nehalem sockets with integrated memory controllers, hyperthreading, and hardware virtualization. Per-server effective compute rose roughly 8 to 15 times. Consolidation via VMware ESX often pushed effective utility 20× or higher.

The 1 Gbps NIC (≈125 MB/s) and 7,200 RPM SATA drive (≈80 MB/s sequential) hadn’t moved meaningfully. Pre-2005 workloads tolerated this because they were largely latency-bound, traffic was serialized, and the asset cost was low enough that fractional utilization didn’t matter.

What changed wasn’t that the imbalance appeared. It’s that the imbalance became intolerable once consolidated workloads ran in parallel. Once virtualization put 10 to 20 VMs on a single box, east-west traffic exploded. A rack of 20 hosts × 10 VMs × 50 Mbps peer traffic ≈ 10 Gbps per rack, against 1 Gbps uplinks oversubscribed 4:1. Between 2009 and 2013, 10GbE became the top-of-rack standard. Arista went public on this wave.

With compute and network rebalanced, spinning disk became the visible bottleneck. Enterprise SSDs went from exotic to standard. Pure Storage IPO’d in 2015. NVMe became default by 2016.

Three waves, one engine, in sequence. Each layer was solved before the next became binding. The CPU cycle could be solved sequentially because the layers were loosely coupled. A faster NIC didn’t require redesigning the CPU. A faster SSD didn’t require redesigning the network.

Economic pattern: a balanced cost stack made cloud rental work

Here’s the part most technical analyses skip. The cloud business model of pay-as-you-go OpEx instead of CapEx wasn’t a marketing innovation. It was the economic consequence of one specific property of CPU servers. Capital cost and operating cost were roughly the same order of magnitude.

The TCO breakdown of a typical CPU server makes this concrete: about $301/month in capital costs versus $220/month in hosting. Capital is 58% of TCO, operating is 42%. They’re in the same range.

That balance has two consequences. First, fractional utilization still pencils. If I rent a server’s capacity to 10 customers each using it 10% of the time on average, I can still charge each less than the full carrying cost and make a margin. Second, operating efficiency and capital efficiency mattered roughly equally. Hyperscalers competed on power costs, cooling design, network architecture, and software efficiency, and they competed on financing terms.

The cloud model was, at its foundation, a financial-engineering response to an asset cheap enough to time-share.

AWS could buy CPU servers at $10K each, run them at 50% utilization across thousands of tenants, and earn returns through operating leverage. The asset was patient. Depreciation periods stretched to 5 or 6 years comfortably. Refresh cycles were forgiving.

The entire industry structure of hyperscaler intermediation, multi-tenant cloud, and OpEx-financed enterprise IT followed from the economics of a balanced cost stack.

Part 3 | The AI-era supercycle runs concurrently, and depends on workload liquidity

Technical pattern: co-design replaces sequence

The trigger is transformer-based workloads. The forcing function (Amdahl plus queueing) is the same. But two structural things changed that the CPU-era mental model gets wrong.

The layers are more tightly coupled than in the CPU era, and effective design requires coordinated control over them.

Consider the following.

– Silicon design depends on fabric topology, because collective communication latency determines effective FLOPS per chip.

– Fabric design depends on parallelism strategy, because tensor, pipeline, expert, and data parallelism each have different traffic patterns.

– Rack design depends on power and cooling envelope, because thermal density limits interconnect length, which determines achievable bandwidth.

– Orchestration design depends on hardware heterogeneity, because workloads route across TPUs, GPUs, and custom accelerators with different memory models.

This doesn’t mean only full-stack owners can compete. The NVIDIA + ODM + hyperscaler ecosystem is itself a co-design model, with reference architectures, validated configurations, and shared roadmaps doing some of the integration work that vertical owners do internally. The real distinction is degree of control over workload, scheduler, silicon roadmap, fabric, and demand aggregation.

Google (full vertical), Meta (custom silicon + own fabric), and AWS (Trainium + Nitro + own networking) sit at one end. CoreWeave and Crusoe (sophisticated operators on NVIDIA reference platforms) sit in the middle. Smaller buyers consuming hyperscaler SKUs sit at the other end. Position on this spectrum matters more than full-stack ownership.

The workload is splitting into two architectural archetypes. In the CPU era, web serving, databases, batch analytics, and video encoding all had different shapes. But x86 was flexible enough to absorb them all at acceptable efficiency. Volume mattered more than peak efficiency, so one architecture won. That logic is breaking, because the efficiency gap between general-purpose and specialized silicon has grown wide enough that good enough at everything loses to purpose-built for one phase.

The clearest evidence is the disaggregation of inference into prefill and decode phases. The academic literature has been clear for two years. Splitwise from Microsoft Research demonstrated 1.4× higher throughput at 20% lower cost via disaggregation, and DistServe, Helix, and AccelGen extended the framework. NVIDIA Dynamo made disaggregated inference a first-class framework primitive.

The most striking validation came at GTC 2026. NVIDIA had announced Rubin CPX in September 2025 as a purpose-built prefill accelerator (GDDR7 instead of HBM, compute-heavy). By GTC 2026 six months later, Rubin CPX was off the roadmap. It was replaced by something more significant: NVIDIA paid roughly $20 billion to license Groq’s SRAM-based LPU technology, and is shipping the Groq 3 LPX rack in Q3 2026 as the dedicated decode accelerator in the Vera Rubin platform.

Rubin GPUs handle prefill. Groq LPUs handle decode. Dynamo orchestrates between them. NVIDIA, which had every incentive to keep inference on its own GPUs, paid $20B to architect bifurcation into its flagship platform. This is the strongest possible validation that prefill and decode are different enough to demand different silicon.

The physics is well-understood. Prefill is compute-bound and parallel. It does heavy matrix math at large batches, with bandwidth a secondary concern. It looks a lot like training. Decode is sequential and memory-bandwidth-bound. It generates one token at a time, must read the entire model and KV cache per token, and is often forced to batch size 1 for latency. On an H100, decode wastes most of the chip’s compute capacity waiting for HBM reads. Groq’s LPU addresses this by putting 512 MB of SRAM on-die delivering 150 TB/s of bandwidth, which is roughly 40× HBM3 bandwidth per die.

So the workload is separating into two architectural archetypes:

– Compute-dense: training and inference prefill. Heavy FLOPS, large batches, bandwidth-secondary.

– Memory-bandwidth-dense: inference decode, especially agentic and reasoning. Sequential, low batch size, latency-critical.

Fine-tuning and RL are workload variants, though RL has surfaced a third-order effect. Frontier labs are running out of CPUs for RL rollouts and are competing with hyperscalers for x86 inventory.

Memory wall, redux

NVIDIA H100 SXM5 delivers 1,979 dense TFLOPS in FP8 against 3.35 TB/s HBM3 bandwidth. Ironwood TPU delivers 4,614 dense FP8 TFLOPS against 7.4 TB/s HBM3e. Math throughput rose 2.3× generation-over-generation. Memory bandwidth rose 2.2×. The ratio held, which means the memory wall isn’t closing.

HBM has become a structural constraint, sold out through 2026 across SK Hynix, Micron, and Samsung. In current AI server BOMs, HBM is approaching 30 to 40% of total system cost. Groq’s SRAM-based approach is one response. HBM4 is another. CXL-attached memory is a third. None of them resolve the underlying tension between FLOPS scaling and bandwidth scaling on a single die.

Network at planet scale

The largest production AI fabrics have no precedent in the CPU era. Google’s internal Jupiter network runs at roughly 13 petabits per second. Its AI training fabric (Virgo) operates at 47 Pb/s linking ~134,000 TPUs nonblocking, with a roadmap toward 1 million chips in a single logical training job.

Compare to a top-of-rack 10GbE switch in 2013 (~640 Gbps total switching capacity). The fabric scale gap is roughly six orders of magnitude in 13 years. Network architecture isn’t a downstream consequence of the AI workload anymore. It’s a first-order capability.

Economic pattern: capital dominates, but operating reality is the multiplier

Now the second engine.

The TCO framework applied to GPU servers shows the inversion. Where CPU servers split roughly $301 capital / $220 hosting per month, GPU servers split roughly $7,025 capital / $1,871 hosting. Capital is 79% of TCO. The capital share has risen from 58% to 79%, which is a structural shift in what determines unit economics.

That doesn’t mean operating efficiency stops mattering. It means operating efficiency is now the multiplier on capital efficiency rather than a co-equal lever. Power costs, cooling design, failure rates, network oversubscription, cluster availability, and utilization losses all directly determine whether the capital earns revenue at all. A poorly run AI factory loses money even when its capital structure looks pristine.

A well-run one converts capital advantages into compounding returns. So when I say capital is the primary economic variable, I mean it sets the ceiling. Operating reality determines how close to that ceiling you actually run.

The naive reading of capital dominance is that GPUs depreciate over 3 to 4 years and anything below high utilization is ruinous. That reading misses how diversified operators actually run their fleets.

The value cascade is a plausible and currently well-supported pattern, though not yet proven durable across all operators. The industry framing is a rough three-tier lifecycle:

– Years 1 to 2: Frontier training. Latest-generation silicon (B200, GB300, Rubin) supports peak-performance training of frontier models.

– Years 3 to 4: Production inference. As newer GPUs take the training slot, previous-generation silicon (H100, H200) cascades down to high-value real-time inference.

– Years 5 to 6: Batch and analytics. A100s and older H100s support batch processing, analytics workloads, and cost-sensitive inference.

The cascade is reflected in accounting. Microsoft moved server and network equipment useful life from 4 to 6 years effective FY2023. Meta extended certain servers and network assets to 5.5 years effective 2025, reducing expected 2025 depreciation by about $2.9B. Amazon’s 2025 annual report describes chips, servers, and networking gear as generally 5 to 6 year assets, with data centers as 30+ year assets.

The cascade collectively saved hyperscalers roughly $18B in 2024 depreciation expense. Operational evidence supports the accounting: CoreWeave’s H100s from 2022 contract expirations rebooked at ~95% of original pricing in 2025. A100s, announced in 2020, remain in strong inference demand five years later.

But it’s important to be honest about what’s proven and what’s assumed. Three serious counterarguments exist. First, software efficiency gains (model architecture improvements, quantization, sparsity) could compress the cascade if algorithmic progress makes old silicon obsolete faster than demand at the prior-generation performance level grows. Second, if hyperscalers simultaneously rotate older chips into secondary markets at scale, prices could collapse. Third, specialized operators (Groq’s CEO has argued for 1-year depreciation) can’t cascade because their workloads don’t diversify across performance tiers.

So the cascade is real and observable in 2025 to 2026 pricing data, but it depends on continued inference demand growth at prior-generation performance levels and on software-hardware co-evolution staying gentle enough that old silicon retains useful work.

The cascade is also messier than the three-tier ladder suggests. GPUs are constrained by customer contracts, cluster topology, networking, physical location, power and cooling envelope, firmware stack, memory capacity, software support, security isolation, and tenant requirements. You cannot always move H100s from “frontier training” into “inference” just because the spreadsheet says so. Realized cascade value is the expected outcome across a diversified fleet, not a guaranteed line item.

This produces a more nuanced economic picture than “capital dominates, utilize-or-die”:

– Hyperscalers’ realized cost of capital is lower than 3-year depreciation suggests, because they capture cascade value across diverse workloads. This is a real but bounded structural advantage of vertical integration.

– Operators running narrower workload mixes face higher effective depreciation, because they can’t cascade as effectively and must sell into secondary markets when demand shifts.

– The build-vs-rent crossover depends on whether the buyer can run workloads that span generations and phases, not just on raw utilization.

Three economic positions emerge

The combination of capital-dominated TCO, compressed refresh cycles, and cascade-dependent life extension produces three distinct economic positions:

– Hyperscalers and frontier labs own the asset, run at high sustained utilization, build custom silicon to internalize NVIDIA margin, and integrate vertically to capture co-design and cascade advantages. Low cost of capital. Long realized depreciation. Their economics work because they can cascade across diverse internal workloads.

– Neocloud operators (CoreWeave, Lambda, Crusoe, Nebius, and others) own the asset using debt-heavy capital structures and sell GPU-hours into a volatile market. They differ meaningfully from each other on contract duration, customer concentration, debt terms, utilization guarantees, managed-service layer, geography, and power access. Their structural risk is duration mismatch: long-lived debt against short-lived GPU pricing assumptions. If H100 rental prices fall faster than their amortization schedules assume, the cascade thesis breaks for them specifically.

– Renters and small/mid players can’t hit utilization thresholds and can’t run mixed-generation fleets at scale, so renting beats owning. They buy across hyperscalers (high price, high reliability), neoclouds on-demand (medium price, medium reliability), and neocloud spot (low price, low reliability).

The rent-vs-buy math reflects this. An 8× H100 SXM server at ~$325K plus ~$75K/year operating, depreciated over 4 years, runs to ~$156K/year. That’s roughly $2.23/GPU-hour at 100% utilization, or $3.70 to $5.60/GPU-hour at 40 to 60% utilization. Against AWS P5 list (~$6.88/GPU-hr), buying wins easily. Against neocloud on-demand (~$2 to $3/GPU-hr), it loses at typical utilization.

If the buyer can cascade the asset across workload tiers and extend useful life from 4 to 6 years, the capital amortization drops from $81K to $54K per year. That’s a 40% reduction in the capital component. Total annual cost falls from $156K to $129K, which is a 17% reduction in total carrying cost. Smaller than the headline 40% sounds, but still meaningful. At 50% utilization, the all-in unit cost drops from ~$4.46/GPU-hr to ~$3.68/GPU-hr, which crosses below typical neocloud on-demand pricing. Cascade capability shifts the buy decision in favor of buying, but only modestly. It’s not a binary switch.

In the CPU era, cloud rental dominated because the asset was cheap enough to time-share. In the GPU era, the market is fragmenting along two axes: cost of capital, and workload liquidity. The buyers who win on both axes are the ones who can match changing workloads to changing hardware generations continuously.

Part 4 | The orchestration layer is where workload liquidity gets realized

There’s one more layer the bottleneck framework misses if you stop at silicon, memory, network, and power. The software layer that routes workloads across all of them. And while it doesn’t create the cascade by itself, it increases the probability that the cascade becomes cash flow rather than stranded capacity.

In the CPU era, the OS layer was Linux, then virtualization, then Kubernetes. These were abstractions that hid hardware homogeneity beneath a uniform interface. The hardware was different enough that virtualization mattered, but similar enough that Kubernetes could treat any x86 box as interchangeable.

In the AI era, the hardware is radically heterogeneous. TPUs, GPUs across multiple generations, custom accelerators (Trainium, MTIA, LPU), different memory architectures, different fabric topologies, different precision formats. And the workload is dynamic. The serving layer has to decide, on every request, which generation handles prefill, which handles decode, which model variant runs where, what batch size to use, and how to meet the SLO for the specific request type.

Orchestration is becoming the operating system for AI infrastructure. And it’s also the financial mechanism that improves the yield on mixed-fleet economics, even if it doesn’t unlock that yield on its own.

The relevant technical concept is good-put. Not raw throughput in tokens per second, but tokens per second that meet the SLO. Academic research has been converging on this for two years:

– DistServe (OSDI 2024) formalized goodput-optimized serving via prefill/decode disaggregation.

– AccelGen (2025) demonstrated up to 13.71× higher goodput by SLO-aware scheduling across heterogeneous request types.

– Helix (ASPLOS 2025) explicitly addressed serving LLMs over heterogeneous GPUs and network via max-flow optimization.

– DOPD (2026) extends NVIDIA Dynamo with dynamic reconfiguration of prefill and decode allocation based on real-time load.

– LoRAServe showed 2× throughput and 50% fewer GPUs by workload-aware adapter placement.

A queueing-theory study of mixed-fleet inference surfaced seven results that contradict naive intuition. The optimal split threshold is not readable off the latency distribution. A 30%-utilized fleet can fail its SLO. A slow GPU can beat a fast one on cost. GPU scaling is sub-linear. The sizing router should not be the production router. Mixed GPU types can fail even as they save money. And in disaggregated serving, the cheaper GPU should handle prefill, not decode.

None of these results are visible if you treat the fleet as homogeneous. All of them shape the realized economics of a mixed-generation fleet running diverse workloads.

But the operational reality is messier than the academic literature suggests. Production inference is constrained by SLOs, model placement, KV cache pressure, cold starts, token variance, tenant isolation, quota management, and failure domains. The best router on paper may be defeated by where the GPUs physically sit and which customers are contractually entitled to them.

Older silicon can still be monetized through batch inference, fine-tuning, embeddings, smaller models, internal workloads, reserved customers, spot markets, or lower-tier cloud SKUs, with or without sophisticated orchestration. Sophisticated orchestration improves yield. It is not always the binary difference between cash flow and idle assets.

That said, the upper bound on what an operator can extract from a mixed-generation fleet is set by orchestration capability. Stack-owning hyperscalers have it natively (Pathways at Google, internal routing at Meta and AWS). Sophisticated neoclouds are building it (CoreWeave deploying Dynamo, multi-tenant routing across customer workloads). The recent NVIDIA-Groq integration means even single-vendor environments now require orchestration to manage heterogeneous silicon. The orchestration layer is where the realized advantage of vertical integration gets paid out, even though it isn’t the source of that advantage.

Part 5 | What this predicts

Five structural differences from the CPU cycle:

Layers are more tightly coupled, and require coordinated design. The CPU cycle was loosely coupled. The AI cycle is tightly coupled. Effective competition requires control over enough of the stack (workload, scheduler, silicon roadmap, fabric, demand aggregation) to drive coordinated design, whether through full vertical ownership or through ecosystem position.

Workloads have separated into architectural archetypes. Compute-dense (training and prefill) and memory-bandwidth-dense (decode, especially agentic) have different silicon, fabric, and economic requirements. NVIDIA’s $20B Groq deal in late 2025 made this irreversible at the platform level.

Capital structure and workload liquidity jointly define competitive position. When capital is 79% of TCO, cost of capital is the primary economic variable. But realized depreciation depends on the operator’s ability to keep the asset productive across workload tiers as newer silicon arrives. This favors diversified hyperscalers, challenges neoclouds (whose risk is duration mismatch between long-lived debt and volatile GPU pricing), and forces small/mid players into rental markets.

The orchestration layer is the new contested frontier. Heterogeneous serving across multi-generation, multi-vendor fleets is where workload liquidity gets realized. This is where standalone middleware (Pathways, Dynamo, Ray, vLLM, SGLang) is competing for the position Kubernetes held in the CPU era.

Power is the new bottom layer. 5-to-10-year build times for transmission, generation, and substations are far longer than chip, fabric, or cooling refresh cycles. This will be the longest-running constraint of the entire cycle. Frontier procurement is now multi-sourced across silicon stacks, geographies, and power markets simultaneously.

The bottom line

The supercycle has two engines. The technical engine is bottleneck migration. The economic engine is an asset class. In the CPU era, both pointed in the same direction. Loosely coupled layers and a balanced cost stack produced sequential investment waves and a cloud rental industry built on time-sharing cheap assets. In the AI era, both point differently. Tightly coupled layers and a capital-dominated cost stack produce concurrent co-design and a fragmented market built around cost of capital and workload liquidity.

GPU economics are capital-dominated. But capital recovery depends less on simple utilization and more on workload liquidity, the ability to continuously match changing workloads to changing hardware generations across training, prefill, decode, batch, and lower-priority inference. The winners are operators with the demand mix, the silicon access, and the orchestration capability to keep their assets productive as the hardware below them cycles. Orchestration doesn’t create the cascade. It increases the probability that the cascade becomes cash flow rather than stranded capacity.

Stack owners and well-positioned ecosystem participants win the technical game because they can drive coordinated design. Diversified operators win the economic game because they have workload liquidity. And the orchestration layer determines how much of either advantage actually shows up in the income statement.

We’re not building a faster CPU cycle. We’re building a different kind of infrastructure, governed by different forces, financed in different ways, and consolidating around a different set of winners. The new investable layers (orchestration, GPU-collateralized debt, capacity-as-asset-class, power) are the ones the CPU era didn’t have language for. That’s the actual investment thesis underneath all the GPU spend.

Vish Nandlall is a leading technology strategist focused on the convergence of 6G, AI, and autonomous systems. As Founder and Lead Analyst of Vish Nandlall Consulting, he advises global operators, investors, and policymakers on the economics and architecture of next-generation networks. A former CTO at major telecom and cloud organizations, Vish has shaped industry roadmaps across 5G, edge computing, and AI infrastructure. His current work explores how intelligence, connectivity, and computation will fuse to define the 6G era.

Useful Links

Edtior's Picks

Latest Articles

The AI infrastructure supercycle (Analyst Angle)

AT&T frames satellite as filling the rural 1%, not eating telco margins

Editorial Report: Securing telecom infrastructure for the quantum era

Useful Links

Edtior's Picks

Latest Articles