Google opened Cloud Next 2026 in Las Vegas on April 22 by unveiling the eighth generation of its Tensor Processing Unit — for the first time split into two distinct chips. TPU 8t is the training chip, designed to scale up to 9,600 TPUs and two petabytes of shared high-bandwidth memory in a single superpod. TPU 8i is the inference chip, using a new Boardfly interconnect topology and 384 megabytes of on-chip SRAM per chip — triple the SRAM of the previous-generation Ironwood TPU. Google says both chips deliver roughly double the performance-per-watt of Ironwood, and that TPU 8i lands up to 80% better performance-per-dollar on reasoning and mixture-of-experts workloads. Both will ship later in 2026.
Splitting training and inference into separate chips is a meaningful architectural bet. Until now, most accelerators — including Nvidia's H100 and B200 — are sold as general-purpose AI chips that do both. Google's thesis is that the agentic era shifts the economic center of gravity to inference: many concurrent agents, each making long sequences of small model calls, where latency and memory bandwidth matter more than raw training throughput. A chip tuned specifically for that workload can cut cost per token enough to change which applications are economic to run at scale.
The announcement lands in a year when AI hardware spending has become the dominant line item for hyperscalers and Nvidia's margins have drawn investor scrutiny. Google is one of the very few companies with both the silicon design capability and the data-center scale to run its own chips in production — Amazon has Trainium and Inferentia, Microsoft has Maia, and Meta has MTIA. The TPU 8t/8i launch tightens the competitive picture: three of the four largest AI buyers now have credible in-house alternatives to Nvidia for at least part of their workload, which matters for pricing power across the whole industry.
For learners: custom silicon is no longer a curiosity — it is a strategic moat. If you are studying machine learning, understanding how memory hierarchy, interconnect topology, and precision formats shape what a chip can do well is increasingly valuable. The same model runs very differently on a TPU pod, a Hopper GPU, and an Apple Neural Engine, and the differences are not subtle. Chip-aware ML engineering is a career track that barely existed five years ago and is now in short supply.