IMO, this is something that makes sense for PyTorch to release, as "neutral ground" in the industry.
Every time the high level architectures of models change, there are new lower level optimizations to be done. Even recent releases like GPT-OSS adds new areas for improvements, like MXFP4, that requires the lower level parts to created and optimized.
Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?
From https://news.ycombinator.com/item?id=45775181 re: current TPUs in 2025; "AI accelerators" :
> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?
> How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics?
LLMs? all the time? "tensor physics" (whatever that is) never
> How reconfigurable are TPUs?
very? as reconfigurable as any other programmable device?
> Are there any standardized feature flags for TPUs yet?
have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).
> Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?
i don't see why it wouldn't be? you're just asking is (stuff done)/(energy consumed) a good measure of efficiency to which the answer is yes?
X86, ARM, and RISC have all standardized on feature flags which can be reviewed on Linux with /proc/cpuinfo or with dmidecode.
cat /proc/cpuinfo | grep -E '^processor|Features|^BogoMIPS|^CPU'
There are multiple TPU vendors.
I listed multiple AI accelerator TPU products in the comment you are replying to.> How reconfigurable are TPUs?
TIL Google's TPUs are reconfigurable with OCS Optical Circuit Switches that can be switched between for example 3D torus or twisted torus configurations.
(FWIW also, quantum libraries mostly have Line qubits and Lattice qubits. There is a recent "Layer Coding" paper; to surpass Surface Coding.)
But classical TPUs;
I had already started preparing a response to myself to improve that criteria; And then paraphrasing from 2.5pro:
> Don't rank by TOPS/wHr alone; rank by TOPS/wHr @ [Specific Precision]. Don't rank by Memory Bandwidth alone; rank by Effective Bandwidth @ [Specific Precision].
Hardware Rank criteria for LLM hosting costs:
Criterion 1: EGB (Effective Generative Bandwidth) Memory Bandwidth (GB/s) / Precision (Bytes)
Criterion 2: GE (Generative Efficiency) EGB / Total Board Power (Watts)
Criterion 3: TTFT Potential Raw TOPS @ Prompt Precision
LLM hosting metrics: Tokens Per Second (TPS) for throughput, Time to First Token (TTFT) for latency, and Tokens Per Joule for efficiency.
There are not - TPU is literally a Google trademark:
> Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google.
https://en.wikipedia.org/wiki/Tensor_Processing_Unit
The rest of what you're talking about is irrelevant
NPU: Neural Processing Unit: https://en.wikipedia.org/wiki/Neural_processing_unit
Coprocessor: https://en.wikipedia.org/wiki/Coprocessor
It's just like game optimization, cache-friendliness and memory hierarchy-awareness are huge in attention mechanism. But programming backward pass in these lower-level stacks is definitely not fun, tensor calculus breaks my brain.
new attention mechanisms also often need new kernels to run at any reasonable rate
theres definitely a breed of frontend-only ML dev that dominates the space, but a lot novel exploration needs new kernels
It's also kinda of ironic that right now in 2025, we have all this diversity in tooling, but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.
What? CUDA won't be irrelevant for years even if all the competitors figure out the holy grail, the ecosystem doesn't suddenly migrate over night. People learning CUDA today will continue to be find jobs and opportunities across the sector for the near future without any worries.
> but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.
That's also not true, the ML space is still growing, and lots of things outside of Transformers, but it requires you to actually look and pay attention, not just browse the HN and r/localllama frontpage.
Overall, these do not seem to be the sentiments coming from someone inside the ML space, but rather from an onlookers perspective.
Lol this is so wrong it's cringe.
> There's now so many different and opinionated takes on how you should write high performant accelerator cluster code. I love it.
There are literally only 2: SIMT (ie the same as it always was) and tiles (ie Triton). That's it. Helion is just Triton with more auto-tuning (Triton already has auto-tuning).
But then again, I've heard that it's this low level because its meant for engine developers.
Helion abstracts syntax and design for calculating λ-functions, which converts language in a kernel config.
>> out = torch.empty([m, n], dtype=x.dtype, device=x.device)
The accumulator has been initialized to zero, since well, they have to add stuff into it.
>> acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
> idiomatic
No as far as I have seen they generally try to not initialize if its not necessary.
> overhead
There is the memory bandwidth point as you might expect. But additionally when using high level interfaces like pytorch, when you write torch.zeros(512, 512) in pytorch, it launches a whole kernel (tens of micros) just for that line. So that's cpu -> gpu -> back to cpu, and then it does the next line, where it goes to gpu again and uses that memory. So in these cases you make sure to avoid it if its in a hot path. Ideally you want the 2nd kernel to do the initialization itself. When you write cuda c++ yourself this is how you typically do it. Helion being a compiler might be doing this optimization, but runtime based torch can't clearly.
For best performance I would presume one needs low-level access to hardware knobs. And, these kernel primitives are written one-time and reused. So, what is the point of a DSL that dumbs things down as a wrapper around triton.
If I had to run on AMD I'd rather deal with their hipify tooling.
Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.
For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.
It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.
One of the main values of Triton is that it significantly expanded the scope of folks who can write kernels - I think Helion could expand the scope even more.