So, yes, you can run DeepSeek R1 on it, but there are cheaper options (if we only talk about inference).
So that would put a p300 unit at 64 GB GDDR6 and 1 TB/s bandwidth. Very competitive, considering that Tenstorrent is now the only vendor to offer scale-out at reasonable price point. Whoever figures out how to make it work with Corundum[1] for unlimited K/V cache offloading is going to make a lot of money: as agents spend more time executing tool-code, and de-coupled from chats, the individual jobs will take more and more time, so scheduling will become more important. How do you manage TB's of K/V cache concurrently?
People complaining about bandwidth are not seeing the bigger picture. Probably because they're unaware NVMe-oF exists, and never kept up with modern network topologies, because hyperscaler Kool-Aid doesn't include it.
Blackhole is actually usable. They cost slightly more than a used 3090 Ti with 24 GB VRAM, but they come with 32GB GDDR6 and 4x 800G networking (apparently only blackholes to other blackholes).
Nvidia's datacenter GPUs are at least 4x faster, but they also cost 20 times as much. There's also the fact that they have as much SRAM as Groq LPUs.
There is also another present. You get 16 Ascalon cores per Blackhole card. Yes, you've heard that right. You are getting 16 of the fastest RISC-V cores ever developed for a measily $999 to $1,299.
My only complaint is that they have these insane 300W TDPs.
No you don't, you get licensed SiFive X280 cores, which are slow in order cores with 512-bit vector registers (dual issue 256-bit ALUs).
See: https://docs.tenstorrent.com/aibs/blackhole/specifications.h...
and the SiFive X280 page: https://www.sifive.com/cores/intelligence-x280
Tenstorrent really needs to put more VRAM on their cards.
If chinese companies can hack Nvidia GPUs with 48 or 96GB vram at a competitive price, surely Tensorrent can too.
Variants of n300d at $2500 for 48GB and $3900 for 96GB would be instant hits.
~~24GB for $1500 simply isn't gonna do it.~~ (old part of the comment related to the old n300 which can be update with: 32B for $1400 still isn't enough for success. There's some progress, but that's still too low considering it's exotic hardware that will lead to tons of compatibility issues).
That said, it missed the boat on MoE. The future is two tiered memory systems, NVIDIA has already announced they are doing that. Ideally these cards should have 4-8 DIMM slots for a couple channels of DDR5.
That would also make them far more useful for workstations/tinkering.
Real workloads remain to be seen, but if they can actually get a working build of vLLM and their cards remain actually buyable, well, they're doing better than some of the competition...
Almost, except with respect to space in the box and power usage, which are critical IMHO.
> but if they can actually get a working build of vLLM and their cards remain actually buyable, well, they're doing better than some of the competition...
That's a big if though, poor software support is to be expected and you'll need to factor that in IMHO, and that's why they need to beef up the memory. Of course if software support is stellar then it may be good enough of a deal.
You'll note that Apple didn't just immediately resume shipping systems with 1.5TB of RAM when they revised their own system architecture. It's taken them half a decade to recoup a third of that capacity at the VRAM-level speeds they require to unify the GPU and CPU's memory.
To run large MoE models it is.
> Increase DRAM on your card and your bandwidth goes down
Why would it?
> You'll note that Apple didn't just immediately resume shipping systems with 1.5TB of RAM when they revised their own system architecture. It's taken them half a decade to recoup a third of that capacity at the VRAM-level speeds they require to unify the GPU and CPU's memory
I fail to see how a unified architecture on a general purpose CPU is a good illustration when we're discussing PCIe accelerator cards. The problems they face have little in common.
Found answer: https://tenstorrent.com/faq