The Inference Architecture Shift
Training was a batch job. Inference is a 24/7 production workload — and it needs a completely different infrastructure.
What's Technically Happening
Inference is now the majority of AI compute. Industry estimates put inference at roughly two-thirds of total AI compute spend in 2026, up from half in 2025 and one-third in 2023. The shift is driven by product adoption: ChatGPT, Claude, Gemini, Copilot, and countless downstream applications each generate billions of tokens per day of inference traffic. Each query is small individually but aggregates to persistent 24/7 load.
Training and inference have fundamentally different infrastructure requirements. Training is batch: you run a large cluster flat out for weeks, then stop. Inference is production: always on, latency-sensitive, geographically distributed, with tail latency as a hard business constraint. This shifts the bottleneck from raw compute (FLOPS) to memory bandwidth, memory capacity, and network latency.
The emerging architectural response is disaggregated serving. Instead of running the full inference pipeline on the same GPU pool, the workload is split: a "prefill" stage (processing the prompt, compute-heavy) runs on compute-dense GPUs, while a "decode" stage (generating output tokens one at a time, memory-bandwidth-heavy) runs on separate memory-dense GPUs. Empirical measurements show 60–75% throughput improvement over colocated serving for long-context workloads (8K+ token prompts) at high concurrency.
This architecture drives new demand for: CXL memory pooling (letting GPUs share a common memory fabric across the rack), KV-cache storage (the intermediate attention state that must be kept hot between tokens), high-performance persistent storage for model weights and prompt caches, and low-latency east-west networking between pools. Storage and networking spend in AI capex is now growing nearly as fast as compute spend, after years of lagging. Edge inference is also growing as inference latency becomes a user-facing metric; Deloitte describes an emerging three-tier architecture of cloud + core + edge.
In Plain English
For the last three years, when people talked about "AI infrastructure," they mostly meant "training." Training is the phase where you build a model — you take a lot of GPUs, point them at a pile of data, and run them flat out for weeks or months. It's a big project with a clear end. When it's done, you have a model file.
But the model file isn't useful on its own. You have to run it — process queries from users, generate text, stream responses. That's inference, and it's the part where the model actually earns its keep. For most of the last decade, inference was a rounding error compared to training. That has flipped. Inference is now roughly two-thirds of all the AI compute in the world, and climbing. Every ChatGPT conversation, every Claude exchange, every Copilot completion, every image generated — that's inference, running 24 hours a day, 7 days a week, at global scale.
And here is the thing: inference has completely different requirements than training. Training is like running a movie shoot — expensive and intense, but when it wraps, everyone goes home. Inference is like running a restaurant — the kitchen has to be open all the time, every customer wants their food fast, and the bottleneck isn't how big your stove is, it's how quickly you can plate and deliver. In AI terms: inference is less bottlenecked by raw compute speed and more bottlenecked by memory bandwidth, memory capacity, and how fast data can move between chips.
So the infrastructure is being rebuilt for this new reality. One big idea is called "disaggregated serving." Instead of running the whole inference pipeline on one type of GPU, you split it. Compute-heavy GPUs handle the opening part of a conversation (where the model reads the prompt), and memory-heavy GPUs handle the generation part (where the model writes the response one token at a time). This alone delivers 60 to 75% better throughput on long-context workloads. Another big idea is a technology called CXL that lets multiple GPUs share a common memory pool instead of each having its own isolated memory. And storage matters like never before, because the intermediate state of every inference query — called the KV cache — has to be kept somewhere fast.
The practical effect: storage and networking companies, which used to be afterthoughts in AI capex, are now growing nearly as fast as compute itself. The inference winners list is a different list than the training winners list.
Who Benefits Most
Beneficiaries are ranked by the directness of their exposure. Tickers that exist in our explorer link to the company brief.
Primary beneficiaries
Direct, first-order exposure. If the trend plays out, these are the names that capture the majority of the value.
Arista Networks. The networking fabric inside hyperscale AI clusters. Ultra Ethernet for scale-out inference traffic. Positioned as the default for non-NVIDIA fabrics.
Pure Storage. High-performance all-flash storage for model weights, KV caches, and persistent inference state. AI-specific product lines expanding.
NetApp. Enterprise storage increasingly feeding inference pipelines. AI storage segment is NetApp's fastest-growing product line.
Astera Labs. CXL switch chips, memory expansion, fabric interconnect for disaggregated inference. Pure-play beneficiary of the memory-pooling trend.
Secondary beneficiaries
Real exposure but competing with alternatives or dependent on adjacent calls.
Micron. Double-dip: HBM for inference GPUs AND memory for CXL pools and persistent caches.
Marvell. DPUs (data processing units), storage acceleration, and custom silicon for hyperscaler inference pools.
Credo. High-speed electrical cables and active copper connections used in short-reach inference fabric.
Western Digital. Bulk storage for model repositories and cold inference data.
Picks and shovels
Enabling suppliers whose revenue scales with the trend regardless of which frontline vendor wins.