Tenstorrent Launches Blackhole Developer Products at Tenstorrent Dev Day(tenstorrent.com)

63 pointsby fidotrona day ago10 comments

rasengan0a day ago
Nice filter of models supported: https://tenstorrent.com/developers but needs that another filter for Blackhole RISCV p100a p150; my guess going by memory https://tenstorrent.com/hardware/blackhole it is up there with n300 https://docs.tenstorrent.com/aibs/wormhole/specifications.ht...
wincya day ago
I thought this was somehow BitTorrent related. I read this article and am not really sure if this is something to be excited about? Can I run Deepseek R1 with whatever this is?
- krasina day ago
  DeepSeek R1 has 671B parameters. If you want to run at 8 bits/parameter (the native format for DeepSeek R1, in which it was trained, so the top performance possible for the model), one would need about 32 Blackhole p150a ([1]), which is around $42k and 10kW power consumption.
  So, yes, you can run DeepSeek R1 on it, but there are cheaper options (if we only talk about inference).
  1. https://tenstorrent.com/hardware/blackhole
jauntywundrkinda day ago
Data goes in, AI emits hawking radiation out, in fancy patterns. Apt enough metaphor I guess.
tucnak7 hours ago
Since n300's came out, they have publicly shared their roadmap, so I've been waiting for next-generation hardware ever since. They have also announced p300 yesterday (that would put two Blackhole chips on one card, akin to what n300 did before)
So that would put a p300 unit at 64 GB GDDR6 and 1 TB/s bandwidth. Very competitive, considering that Tenstorrent is now the only vendor to offer scale-out at reasonable price point. Whoever figures out how to make it work with Corundum[1] for unlimited K/V cache offloading is going to make a lot of money: as agents spend more time executing tool-code, and de-coupled from chats, the individual jobs will take more and more time, so scheduling will become more important. How do you manage TB's of K/V cache concurrently?
People complaining about bandwidth are not seeing the bigger picture. Probably because they're unaware NVMe-oF exists, and never kept up with modern network topologies, because hyperscaler Kool-Aid doesn't include it.
[1] https://github.com/corundum/corundum
mika6996a day ago
Are the tenstorrent blackhole cards anyhow competitive?
- krasina day ago
  They seem to be showing very decent performance results for diffusion transformers. Not so much for the autoregressive transformers (the "regular" ones).
- sprasha day ago
  A test [1] by a random dude with the older Wormhole N150 delivers half the performance (as in tokens/s) of a RTX 4090 in generic Llama tests. The new p150 should have double the performance according to specs, but who knows. I'd call it somewhat competitive.
  1.: https://youtu.be/WibEx3jfKu0?t=957
- imtringued12 hours ago
  Tenstorrent's Wormhole was garbage for anything other than development.
  Blackhole is actually usable. They cost slightly more than a used 3090 Ti with 24 GB VRAM, but they come with 32GB GDDR6 and 4x 800G networking (apparently only blackholes to other blackholes).
  Nvidia's datacenter GPUs are at least 4x faster, but they also cost 20 times as much. There's also the fact that they have as much SRAM as Groq LPUs.
  There is also another present. You get 16 Ascalon cores per Blackhole card. Yes, you've heard that right. You are getting 16 of the fastest RISC-V cores ever developed for a measily $999 to $1,299.
  My only complaint is that they have these insane 300W TDPs.
  - camel-cdr9 hours ago
    > You get 16 Ascalon cores per Blackhole card.
    No you don't, you get licensed SiFive X280 cores, which are slow in order cores with 512-bit vector registers (dual issue 256-bit ALUs).
    See: https://docs.tenstorrent.com/aibs/blackhole/specifications.h...
    and the SiFive X280 page: https://www.sifive.com/cores/intelligence-x280
littlestymaara day ago
Re-using a comment a wrote some time ago:
Tenstorrent really needs to put more VRAM on their cards.
If chinese companies can hack Nvidia GPUs with 48 or 96GB vram at a competitive price, surely Tensorrent can too.
Variants of n300d at $2500 for 48GB and $3900 for 96GB would be instant hits.
~~24GB for $1500 simply isn't gonna do it.~~ (old part of the comment related to the old n300 which can be update with: 32B for $1400 still isn't enough for success. There's some progress, but that's still too low considering it's exotic hardware that will lead to tons of compatibility issues).
- krasina day ago
  It's 32GB for $1300 for Blackhole p150a([1]). The rest of your point is very true.
  1. https://tenstorrent.com/hardware/blackhole
  - littlestymaara day ago
    I updated my comment accordingly, I had just copy-pasted a comment of mine of Reddit from a few days ago but this part needed an update.
- PinkiesBrain12 hours ago
  It's not meant as a workstation/tinkering system, the card without networking is not the main aim. If you're willing to pay 4k for 96GB, just get 3 with networking.
  That said, it missed the boat on MoE. The future is two tiered memory systems, NVIDIA has already announced they are doing that. Ideally these cards should have 4-8 DIMM slots for a couple channels of DDR5.
  That would also make them far more useful for workstations/tinkering.
- aseippa day ago
  The new p150 cards linked in the OP have 32GB GDDR6 @ 512GB/s for $1,300. Which isn't bad on paper, I guess. They're meant to be networked (quad 800GB QSFP-DD) like Nvidia GPUs, so two of them would get you 64GB of VRAM at $2600 for ~600W which is basically what you're asking for? The power usage isn't good enough yet at scale I think, but for a workstation it's quite manageable.
  Real workloads remain to be seen, but if they can actually get a working build of vLLM and their cards remain actually buyable, well, they're doing better than some of the competition...
  - littlestymaara day ago
    > so two of them would get you 64GB of VRAM at $2600 for ~600W which is basically what you're asking for?
    Almost, except with respect to space in the box and power usage, which are critical IMHO.
    > but if they can actually get a working build of vLLM and their cards remain actually buyable, well, they're doing better than some of the competition...
    That's a big if though, poor software support is to be expected and you'll need to factor that in IMHO, and that's why they need to beef up the memory. Of course if software support is stellar then it may be good enough of a deal.
- bigyabaia day ago
  Dedicated memory isn't the issue. Increase DRAM on your card and your bandwidth goes down; increase the bandwidth and your price increases reciprocally. The solution isn't to just solder more memory anywhere it fits, these are well-paid engineers that are working to optimize a complex problem space. The Chinese board fluxers are working with a different class of hardware that regularly ships with dark silicon, binned hardware and die-chopped configurations.
  You'll note that Apple didn't just immediately resume shipping systems with 1.5TB of RAM when they revised their own system architecture. It's taken them half a decade to recoup a third of that capacity at the VRAM-level speeds they require to unify the GPU and CPU's memory.
  - littlestymaar10 hours ago
    > Dedicated memory isn't the issue.
    To run large MoE models it is.
    > Increase DRAM on your card and your bandwidth goes down
    Why would it?
    > You'll note that Apple didn't just immediately resume shipping systems with 1.5TB of RAM when they revised their own system architecture. It's taken them half a decade to recoup a third of that capacity at the VRAM-level speeds they require to unify the GPU and CPU's memory
    I fail to see how a unified architecture on a general purpose CPU is a good illustration when we're discussing PCIe accelerator cards. The problems they face have little in common.
mixmastamyka day ago
Not clear how RISC-V is an AI accelerator? Is this a special build of a CPU, to look like a GPU?
Found answer: https://tenstorrent.com/faq
a day ago
undefined
tomdawgniggera day ago
[flagged]
lostmsua day ago
It's funny, that they sell a tensor accelerator, but do not mention its TOps anywhere.
- imtringued12 hours ago
  72 Tensix Cores -> Tensix Cores 140. They doubled their TFLOPS and VRAM+memory bandwidth. Thereby maintaining the same TFLOPS to memory ratio.
  - lostmsu10 hours ago
    So what's the number before and after?