Also holy cow that was 10 years ago already? Dang.
Amusing bit: The first TPU design was based on fully connected networks; the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.
So maybe it's reasonable to say that this is the first TPU designed for inference in the world where you have both a matrix multiply unit and an embedding processor.
(Also, the first gen was purely a co-processor, whereas the later generations included their own network fabric, a trait shared by this most recent one. So it's not totally crazy to think of the first one as a very different beast.)
What were the use cases like back then?
[1]: https://cloud.google.com/blog/products/ai-machine-learning/g...
[2]: https://github.com/rikhuijzer/improv/blob/master/runs/2018-1...
The big ones were SmartASS (ads serving) and Sibyl (everything else serving). There was an internal debate over the value of GPUs with a prominent engineer writing an influential doc that caused Google continue with fat CPU nodes when it was clear that accelerators were a good alternative. This was around the time ImageNet blew up, and some eng were stuffing multiple GPUs in their dev boxes to demonstrate training speeds on tasks like voice recognition.
Sibyl was a heavy user of embeddings before there was any real custom ASIC support for that and there was an add-on for TPUs called barnacore to give limited embedding support (embeddings are very useful for maximizing profit through ranking).
[1]: https://research.google/pubs/warehouse-scale-video-accelerat...
Another part that was left out was that Google did not make truly high speed (low-latency) networking and so many of their CPU jobs had to be engineered around slow networks to maintain high utilization and training speed. Google basically ended up internally relearning the lessons that HPC and supercomputing communities had already established over decades.
Ah, the days when you, as a tech company employee, could call a service "SmartASS" and get away with it...
I wasn't on Brain, but got obsessed with Kerminology of ML internally at Google because I wanted to know why leadership was so gung ho on it.
The general sense in the early days was these things can learn anything, and they'll replace fundamental units of computing. This thought process is best exhibited externally by ex. https://research.google/pubs/the-case-for-learned-index-stru...
It was also a different Google, the "3 different teams working on 3 different chips" bit reminds me of lore re: how many teams were working on Android wearables until upper management settled it.
FWIW it's a very, very, different company now. Back then it was more entrepreneurial. A better version of Wave-era, where things launch themselves. An MBA would find this top-down company in 2025 even better, I find it less - it's perfectly tuned to do what Apple or OpenAI did 6-12 months ago, but not to lead - almost certainly a better investment, but a worse version of an average workplace, because it hasn't developed antibodies against BSing. (disclaimer: worked on Android)
One was the transition to a mature product line. In the early days it was about how do we do cool new things that will delight users: Gmail, Google Maps (Where 2), YouTube. The focus was on user growth and adoption.
Then growth saturated and the focus turned to profitability: Getting more value out of existing users and defending the business. That shift causes you to think very differently, and it's not as fun.
The second was changing market conditions. The web grew up, tech grew up, and the investment needed to make a competitive product skyrocketed. Google needed more wood behind fewer arrows and that meant reining in all the small teams running around doing kooky things. Again not fun, but understandable.
Certainly, RNNs are much older than TPUs?!
Can anyone suggest a better (i.e. more accurate and neutral) title, devoid of marketing tropes?
> first designed specifically for inference. For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads...
What do they think serving is? I think this marketing copy was written by someone with no idea what they are talking about, and not reviewed by anyone who did.
Also funny enough it kinda looks like they've scrubbed all their references to v4i, where the i stands for inference. https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf
I know they aren't selling the TPU as boxed units, but still, even as hardware that backs GCP services and what not, its interesting to see how it'll shake out!
Did it?
Both Mistral's LeChat (running on Cerebras) and Google's Gemini (running on Tensors) have clearly showed ages ago Nvidia had no advantage at all in inference.
The hundreds of billions spent in hardware till now focused on training, but inference is in the long run gonna get the lion share of the work.
I'm not sure - might not the equilibrium state be that we are constantly fine-tuning models with the latest data (e.g. social media firehose)?
This benefits everyone, even if you don't use Google Cloud, because of the competition it introduces.
I don’t want to own open source software and I’d prefer if culture was derived from and contributed to the public domain.
I’m still fully on board with capitalism, but there are many instances where I’d prefer to replace physical acquisition with renting, or replacing corporate-made culture with public culture.
Whenever assessing the work involved in building an integration, always assume you’ll be doing it twice. If that sounds like too much work then you shouldn’t have outsourced to begin with.
Like graviton at AWS its as much of a negotiation tool as it is a technical solution, letting them push harder with NVIDIA on pricing because they have a backup option.
P.S. I found an on-the-record statement re Gemini 1.0 on TPU:
"We trained Gemini 1.0 at scale on our AI-optimized infrastructure using Google’s in-house designed Tensor Processing Units (TPUs) v4 and v5e. And we designed it to be our most reliable and scalable model to train, and our most efficient to serve."
Why?
For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.
GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.
Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.
[0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...
I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.
From die shots and materials I’ve seen, it even looks like ~40% of the die might be allocated to memory [1]. Given that, I’m curious about your point on “not enough die for memory” — is it a matter of absolute capacity still being insufficient for current model sizes, or more about the area-bandwidth tradeoff being unbalanced for inference workloads? Or perhaps something else entirely?
I’d love to understand this design tension more deeply, especially from someone with a high-level view of real-world deployments. Thanks again.
[1] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads — Fig. 5. Die photo of 14nm ASIC implementation of the Groq TSP. https://groq.com/wp-content/uploads/2024/02/2020-Isca.pdf
This. Additionally, models aren't getting smaller, they are getting bigger and to be useful to a wider range of users, they also need more context to go off of, which is even more memory.
Previously: https://news.ycombinator.com/item?id=42003823
It could be partially the DC, but look at the rack density... to get to an equal amount of GPU compute and memory, you need 10x the rack space...
https://www.linkedin.com/posts/andrewdfeldman_a-few-weeks-ag...
Previously: https://news.ycombinator.com/item?id=39966620
Now compare that to an NV72 and the direction Dell/CoreWeave/Switch are going in with the EVO containment... far better. One can imagine that AMD might do something similar.
https://www.coreweave.com/blog/coreweave-pushes-boundaries-w...
What I’m still trying to understand is the economics.
From this benchmark: https://artificialanalysis.ai/models/llama-4-scout/providers...
Groq seems to offer near lowest prices per million tokens and the near fastest end to end response times. That’s surprising because in my understanding, speed(latency) and the cost are trade-offs.
So I’m wondering: Why can’t GPU-based providers can't offer cheaper but slower(high-latency) APIs? Or do you think Groq/Cerebras are pricing much below cost (loss-leader style)?
That is curious. Things are moving so quickly right now. I typed out a few speculative sentences then went ahead and asked an LLM.
Looks like Cerebras is responding to the market and pivoting towards a perceived strength of their product combined with the growth in inference, especially with the advent of reasoning models.
https://www.datacenterknowledge.com/data-center-chips/ai-sta...
https://www.semafor.com/article/12/03/2024/amazon-announces-...
What even is an AI data center? are the GPU/TPU boxes in a different building than the others?
Google does many pieces of the data center better. Google TPUs use 3D torus networking and are liquid cooled.
> What even is an AI data center?
Being newer, AI installations have more variations/innovation than traditional data centers. Google's competitors have not yet adopted all of Google's advances.
> are the GPU/TPU boxes in a different building than the others?
Not that I've read. They are definitely bringing on new data centers, but I don't know if they are initially designed for pure-AI workloads.
And I'll echo, what even is an AI data center, because we're still none the wiser.
That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.
An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"
Thank you very much, that is the piece of the puzzle I was missing. Naively, it still seems (to me) far more hops for a 3d torus than a regular multi-level switch when you've got many thousands of nodes, but I can appreciate it could be much simpler routing. Although, I would guess in practice it requires something beyond the simplest routing solution to avoid congestion.
A data center that runs significant AI training or inference loads. Non AI data centers are fairly commodity. Google's non-AI efficiency is not much better than Amazon or anyone else. Google is much more efficient at running AI workloads than anyone else.
I don't think this is true. Google has long been a leader in efficiency. Look at the power usage effectiveness (PUE). A decade ago Google announced average PUEs around 1.12 while the industry average was closer to 2.0. From what I can tell they reported a 1.1 average fleet wide last year. They've been more transparent about this than any of the other big players.
AWS is opaque by comparison, but they report 1.2 on average. So they're close now, but that's after a decade of trying to catch up to Google.
To suggest the rest of the industry is on the same level is not at all accurate.
https://en.wikipedia.org/wiki/Power_usage_effectiveness
(Amazon isn't even listed in the "Notably efficient companies" section on the Wikipedia page).
We've seen the rise of OSS Kubernetes and eBPF networking since, and a lot more that I don't have on-stack rn.
I wouldn't be surprised if everyone else had significantly closed the hardware utilization gap.
No one else has access to anything similar, Amazon is just starting to scale their Trainium chip.
The end of Moore's law pretty much dictates specialization, it's just more apparent in fields without as much ossification first.
Why compare fp64 flops in the El Capitan supercomputer to fp8 flops in the TPU pod when you know full well these are not comparable?
[Edit: it turns out that El Capitan is actually faster when compared like for like and the statement below underestimated how much slower fp64 is, my original comment in italics below is not accurate] (The TPU would still be faster even allowing for the fact fp64 is ~8x harder than fp8. Is it worthwhile to misleadingly claim it's 24x faster instead of honestly saying it's 3x faster? Really?)
It comes across as a bit cheap. Using misleading statements is a tactic for snake oil salesmen. This isn't snake oil so why lower yourself?
In other words El Capitan is between 2 and 4 times as fast as one of these pods, yet they claim the pod is 24x faster than El Capitan.
And so unnecessary too- nobody shopping for AI inference server cares at all about its relative performance vs a fp64 machine. This language seems designed solely to wow tech-illiterate C-Suites.
My impression from this is that they are too scared to say that their TPU pod is equivalent to 60 GB200 NVL72 racks in terms of fp8 flops.
I can only assume that they need way more than 60 racks and they want to hide this fact.
Actually the cost is even much higher, because the cost ratio is not much less than the square of the ratio between the sizes of the significands, which in this case is 52 bits / 4 bits = 13, and the square of 13 is 169.
Hence Tesla saying FSD and robo-taxis are 1 year away, the fusion companies saying fusion is closer than it is etc....
Nvidia, AMD, apple and intel have all been publishing misleading graphs for decades and even under constant criticism they continue to.
A big part of my issue here is that they've really messed up the misleading benchmarks.
They've failed to compare to the most obvious alternative, which is Nvidia GPUs. They look like they've got something to hide, not like they're ahead.
They've needlessly made their own current products look bad in comparison to this one understating the long-standing advantage TPUs have given Google.
Then they've gone and produced a misleading comparison to the wrong product (who cares about El Capitan? I can't rent that!). This is a waste of credibility. If you are going to go with misleading benchmarks then at least compare to something people care about.
No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all about cost/performance factor. 8 bits is as good as 32bits, great performance is even been pulled out of 4 bits...
Dropping the analogy: f64 multiplication is a lot harder than f8 multiplication, but for ML tasks it's just not needed. f8 multiplication hardware is the right tool for the job.
Because end users want to use fp8. Why should architectural differences matter when the speed is what matters at the end of the day?
That doesn't matter much of those few companies are the biggest companies. Even with Nvidia majority of the revenue is being generated by a handful of hyperscalers.
I wonder whether Google sees this as a problem. In a way it just means more AI compute capacity for Google.
The article says: "When scaled to 9,216 chips per pod for a total of 42.5 Exaflops, Ironwood supports more than 24x the compute power of the world’s largest supercomputer – El Capitan – which offers just 1.7 Exaflops per pod."
It is literally compared to a competitor.
It would be awesome for things like homelabs (to run Frigate NVR, Immich ML tasks or the Home Assistant LLM).
Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.
NB: Hobbyist, take all with a grain of salt
It seems to me that floating point math (matrix multiplication) will over time mostly disappear from ML chips, as Boolean operations are much faster both in training an inference. But currently they are still optimized for FP rather than Boolean operations.
https://semiengineering.com/speeding-down-memory-lane-with-c...
TensorFlow, PyTorch, and Jax all support XLA on the backend.
[1]: https://openxla.org/
There is going to be a GPU/Accelerator shortage for the foreseeable future to run the most advanced models, Gemini 2.5 Pro is such a good example. It is probably the first model that many developers i've considered skeptics of extended agent use have started to saturate free token thresholds on.
Grok is honestly the same, but the lack of an API is suggestive of the massive demand wall they face.
edit: >It’s a move from responsive AI models that provide real-time information for people to interpret, to models that provide the proactive generation of insights and interpretation. This is what we call the “age of inference” where AI agents will proactively retrieve and generate data to collaboratively deliver insights and answers, not just data.
maybe i will sound like a luddite but im not sure i want this.
I'd rather AI/ML only do what i ask it to.
https://streaminglearningcenter.com/encoding/asics-vs-softwa...
https://docs.jax.dev/en/latest/pallas/tpu/details.html#what-...
Its not really useful for other workloads (unless your workload looks like a bunch of matrix multiplications).
I always confuse Blackwell with Bakewell (tart) and my CPU is on Coffee Lake and great… now I want coffee and cake
The only support is via a few enthusiastic third party developers.
These continue to be mostly for bragging rights and strategic safety I think. I bet they are not on premium processor nodes; If I worked at GOOG I’d probably think about these as competitive insurance vis-a-vis NVIDIA — total costs of chip team, software, tape outs, and increased data center energy use probably wipe out any savings from not buying NV, but you are 100% not beholden to Jensen.
It's like they take some interesting wood carving of communication, then sand it down to a featureless nub.
Google says Ironwood will be available in the Google Cloud late this year, so it's relevant to just about anyone that rents AI compute, which is just about everyone in tech. Even if you have zero interest in this product, it will likely lead to downward pressure on pricing, mostly courtesy of the large memory allocations.
It just seems like if you build on Tensor then sure, you can go home, but Google will keep your ball.
Most places using AI hardware don't actually want to expend massive amounts of capital to procure it and then shove it into racks somewhere and then manage it over its total lifetime. Hyperscalers like Google are also far, far ahead in things like DC energy efficiency, and at really large scale those energy costs are huge and have to be factored into the TCO. The long dominant cost of this stuff is all operational expenditures. Anyone running a physical AI cluster is going to have to consider this.
The walled garden stuff doesn't matter, because places demanding large-scale AI deployments (and actually willing to spend money on it) do not really have the same priorities as HN homelabbers who want to install inefficient 5090s so they can run Ollama.
Probably whales who can afford to rent one from Google Cloud.
The challenge is getting them to run efficiently, which typically involves learning JAX.
The programmer who writes code to run on these likely costs at least 15x this amount an hour.