Why I find diffusion models interesting?(rnikhil.com)

199 pointsby whoami_nr3 days ago23 comments

mountainriver3 days ago
The most interesting thing about diffusion LMs that tends to be missed, are their ability to edit early tokens.
We know that the early tokens in an autoregressive sequence disproportionately bias the outcome. I would go as far as to say this is some of the magic of reasoning models is they generate so much text they can kinda get around this.
However, diffusion seems like a much better way to solve this problem.
- ithkuil2 days ago
  Yeah reasoning models are "self-doubt" models.
  The model is trained to encourage re-evaluating the soundness of tokens produced during the "thinking phase".
  The model state vector is kept in a state of open exploration. Influenced by the already emitted tokens but less strongly so.
  The non-reasoning models were just trained with the goal of producing useful output on a first try and they did their best to maximize that fitness function.
- kgeist2 days ago
  But how can test-time compute be implemented for diffusion models if they already operate on the whole text at once? Say it gets stuck—how does it proceed further? Autoregressive reasoning models would simply backtrack and try other approaches. It feels like denoising the whole text further wouldn't lead to good results, but I may be wrong.
  - spwa42 days ago
    Diffusion LLMs are still residual networks. You can Google that, but it means that they don't generate the whole text at once. Every layer generates corrections to be made to the whole text at once.
    Think of it like writing a text by forcing your teacher to write for you by entering in the assignment 100 times. You begin by generating completely inaccurate text, almost random, that leans perhaps a little bit towards the answer. Then you systematically begin to correct small parts of the text. The teacher that sees the text, and uses red the red pen to correct a bunch of things. Then the corrected text is copied onto a fresh page, and resubmitted to the teacher. And again. And again. And again. And again. 50 times. 100 times. That's how diffusion models work.
    Technically, it adds your corrections to the text, but that's mathematical addition, not adding at the end. Also technically every layer is a teacher that's slightly different from the previous one. And and and ... but this is the basic principle. The big advantage is that this makes neural networks slowly lean towards the answer. First they decide to have 3 sections, one about X, Y and one about Z, then they decide on what sentences to put, then they start thinking about individual words, then they start worrying about things like grammar, and finally about spelling and pronouns and ...
    So to answer your question: diffusion networks can at any time decide to send out a correction that effectively erases the text (in several ways). So they can always start over by just correcting everything all at once back to randomness.
    kgeist2 days ago
    Yeah, but with autoregressive models, the state grows, whereas with diffusion models, it remains fixed. As a result, a diffusion model can't access its past thoughts (e.g., thoughts that rejected certain dead ends) and may start oscillating between the same subpar results if you keep denoising multiple times.
    atq2119a day ago
    Image diffusion doesn't suffer from this, does it? So why would text diffusion?
    kgeista day ago
    Image diffusion has reasoning?
  - eru2 days ago
    Perhaps do a couple of independent runs, and then combine them afterwards?
vinkelhake3 days ago
I don't get where the author is coming from with the idea that a diffusion based LLM would hallucinate less.
> dLLMs can generate certain important portions first, validate it, and then continue the rest of the generation.
If you pause the animation in the linked tweet (not the one on the page), you can see that the intermediate versions are full of, well, baloney.
(and anyone who has messed around with diffusion based image generation knows the models are perfectly happy to hallucinate).
- gdiamos3 days ago
  Bidirectional seq2seq models are usually more accurate than unidirectional models.
  However, autoregressive models that generate one token at a time are usually more accurate than parallel models that generate multiple tokens at a time.
  In diffusion LLMs, both of these two effects interact. You can trade them off by determining how many tokens are generated at a time, and how many future tokens are used to predict the next set of tokens.
- Legend24402 days ago
  Hallucination is probably a feature of statistical prediction as a whole, not any particular architecture of neural network.
- markisus3 days ago
  Regarding faulty intermediate versions, I think that’s the point. The diffusion process can correct wrong tokens with the global state implies it.
  - evrydayhustling2 days ago
    I think the discussion here is confusing the algorithm for the output. It's true that diffusion can rewrite tokens during generation, but it is doing so for consistency with the evolving output -- not "accuracy". I'm unaware of any research which shows that the final product, when iteration stops, is less likely to contain hallucinations than with autoregression.
    With that said, I'm still excited about diffusion -- if it offers different cost points, and different interaction modes with generated text, it will be useful.
- whoami_nr3 days ago
  The Llada paper: https://ml-gsai.github.io/LLaDA-demo/ here implied strong bidirectional reasoning capabilities and improved performance on reversal tasks (where the model needs to reason backwards).
  I made a logical leap from there.
- mitthrowaway22 days ago
  I'm not sure about hallucination about facts, but it might be less prone to logically inconsistent statements of the form "the sky is red because[...] and that's why the sky is blue".
kelseyfrog3 days ago
I'm personally happy to see effort in this space simply because I think it's an interesting set of tradeoffs (compute ∝ accuracy) - a departure from the fixed next token compute budget required now.
It brings up interesting questions, like what's the equivalency between smaller diffusion models which consume more compute because they have a greater number of diffusion steps compared to larger traditional LLMs which essentially have a single step. How effective is decoupling the context window size to the diffusion window size? Is there an optimum ratio?
- machiaweliczny2 days ago
  I actually think that diffusion LLMs will be best for code generation
  - genewitcha day ago
    Was it on Hacker News a few days ago? There was a diffusion language model that was actually running, but I think it was a paid service. I don't know if anybody mentioned that there was a open source one or one that you could run locally.
prometheus762 days ago
Why did the person who posted this change the headline of the article ("Diffusion models are interesting") into a nonsensical question?
- whoami_nr2 days ago
  Author here. I just messed up while posting.
- amclennon2 days ago
  Considering that the article links back to this post, the simplest explanation might be that the author changed the title at some point. If this were a larger publication, I would have probably assumed an A/B test
antirez2 days ago
There is a disproportionate skepticism in autoregressive models and a disproportionate optimism in alternative paradigms because of the absolutely non verifiable idea that LLMs, when predicting the next token, don't already model, in the activation states, the gist of what they could going to say, similar to what humans do. That's funny because many times it can be observed in the output of truly high quality replies that the first tokens only made sense in the perspective of what comes later.
- spr-alex2 days ago
  maybe i understand this a little differently, the argument i am most familiar with is this one from lecun, where the error accumulation in the prediction is the concern with autoregression https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMR...
  - antirez2 days ago
    The error accumulation thing is basically without any ground as regressive models correct what they are saying in the process of emitting tokens (trivial to test yourself: force a given continuation in the prompt and the LLMs will not follow at all). LeCun provided an incredible amount of wrong claims about LLMs, many of which he now no longer accepts: like the stochastic parrot claim. Now the idea that there is just a statistical relationship in the next token prediction is considered laughable, but even when it was formulated there were obvious empirical hints.
    spr-alex2 days ago
    i think the opposite, the error accumulation thing is basically the daily experience of using LLMs.
    As for the premise that models cant self correct that's not the argument i've ever seen, transformers have global attention across the context window. It's that their prediction abilities are increasingly poor as generation goes on. Is anyone having a different experience than that?
    Everyone doing some form of "prompt engineering" whether with optimized ML tuning, whether with a human in the loop, or some kind of agentic fine tuning step, runs into perplexity errors that get worse with longer contexts in my opinion.
    There's some "sweet spot" for how long of a prompt to use for many use cases, for example. It's clear to me that less is more a lot of the time
    Now will diffusion fare significantly better on error is another question. Intuition would guide me to think more flexiblity with token-rewriting should enable much greater error correction capabilities. Ultimately as different approaches come online we'll get PPL comparables and the data will speak for itself
    HeatrayEnjoyer2 days ago
    >force a given continuation in the prompt and the LLMs will not follow at all
    They don't? That's not the case at all, unless I am misunderstanding.
    antirez2 days ago
    I'm not talking about the fine tuning that make them side with the user even when they are wrong (anyway, this is less and less common now compared to the past, but anyway it's a different effect). I'm referring if in the template you make the assistant reply starting with wrong words / directions, and the LLM finds a way to say what it really meant saying "wait, actually I was wrong" or other sentences that allow it to avoid following the line.
kazinator2 days ago
Interestingly, that animation at the end mainly proceeds from left to right, with just some occasional exceptions.
So I followed the link, and gave the model this bit of conversation starter:
> You still go mostly left to right.
The denoising animation it generated went like this:
> [Yes] [.] [MASK] [MASK] [MASK] ... [MASK]
and proceeded by deletion of the mask elements on the right one by one, leaving just the "Yes.".
:)
gdiamos3 days ago
I think these models would get interesting at extreme scale. Generate a novel in 40 iterations on a rack of GPUs.
At some point in the future, you will be able to autogen a 10M line codebase in a few seconds on a giant GPU cluster.
- gdiamos3 days ago
  Diffusion LLMs also follow scaling laws - https://proceedings.neurips.cc/paper_files/paper/2023/file/3...
  - impossiblefork2 days ago
    Those aren't the modern type with discrete masking based diffusion though.
    Of course, these too will have scaling laws.
  - esperent2 days ago
    Is it possible that combining multiple AIs will be able to somewhat bypass scaling laws, in a similar way that multicore CPUs can somewhat bypass the limitations of a single CPU core?
    gdiamos2 days ago
    I’m sure there are ways of bypassing scaling laws, but I think we need more research to discover and validate them
- nthingtohide2 days ago
  I read a wikipedia article of a person who was very intelligent but also suffered from a mental illness. He told people around him that his next novel will be of exactly N number of words and it will end with the sentence P.
  I don't remember article. I read it a decade ago. It's like he was doing diffusion in his mind, subconsciously perhaps
  - eru2 days ago
    Seems pretty easy to achieve if you have text editing software that tells you the number of words written so far?
DeathArrow2 days ago
That got me thinking that it would be nice to have something like ComfyUi to work with diffusion based LLMs. Apply LORAs, use multiple inputs, have multiple outputs.
Something akin to ComfyUi but for LLMs would open up a world of possibilities.
- terhechte2 days ago
  Check out Floneum, it's basically ComfyUI for LLM's, extendable via plugins
  https://floneum.com/
  Scroll down a bit on the website to see a screenshot.
  - DeathArrow2 days ago
    Thank you!
- dragonwriter2 days ago
  ComfyUI already has nodes (mostly in extensions available through the built in manager) for working with LLMs, both remote LLMs accessed through APIs and local ones running under Comfy itself, the same as it runs other models.
- hdjrudni2 days ago
  Maybe not even 'akin' but literally ComfyUI. Comfy already has a bunch of image-to-text nodes. I haven't seen txt2txt or Loras and such for them though. But I also haven't looked.
  - Philpax2 days ago
    It's complicated by the ComfyUI data model, which treats strings as immediate values/constants and not variables in their own right. This could ostensibly be fixed/worked around, but I imagine that it would come at a cost to backwards compatibility.
jacobn3 days ago
The animation on the page looks an awful lot like autoregressive inference in that virtually all of the tokens are predicted in order? But I guess it doesn't have to do that in the general case?
- creata3 days ago
  The example in the linked demo[0] seems less left-to-right.
  Anyway, I think we'd expect it to usually be more-or-less left-to-right -- We usually decide what to write or speak left-to-right, too, and we don't seem to suffer much for it.
  (Unrelated: it's funny that the example generated code has a variable "my array" with a space in it.)
  [0]: https://ml-gsai.github.io/LLaDA-demo/
  - frotaur3 days ago
    Very related : https://arxiv.org/abs/2401.17505
  - whoami_nr3 days ago
    yeah but you can backtrack your thinking. You also have a mind voice to plan out the next couple words/reflect/self correct before uttering them.
- whoami_nr3 days ago
  So, in practice there are some limitations here. Chat interfaces force you to feed the entire context to the model everytime you ping it. Even multi step tool calls have a similar thing going. So, yeah we may effectively turn all of this effectively into autoregressive models too.
chw9e2 days ago
This was a very cool paper about using diffusion language models and beam search: https://arxiv.org/html/2405.20519v1
Just looking at all of the amazing tools and workflows that people have made with ComfyUI and stuff makes me wonder what we could do with diffusion LMs. It seems diffusion models are much more easily hackable than LLMs.
mistrial93 days ago
this is the huggingface page https://huggingface.co/papers/2502.09992
alexmolas2 days ago
I guess the biggest limitation of this approach is that the max output length is fixed before generation starts. Unlike autoregressive LLM, which can keep generating forever.
- gdiamos2 days ago
  max output size is always limited by the inference framework in autoregressive LLMs
  eventually they run out of memory or patience
bilsbie2 days ago
What if we combine the best of both worlds? What might that look like?
flippyhead2 days ago
It's a pet peeve of mine to make a statement in the form of a question?
- ajkjk2 days ago
  I don't know why (and am curious) but this particularly odd question phrasing seems to happen a lot among Indian immigrants I've met in America. Maybe it's considered grammatically correct in India or something?
  - exe342 days ago
    I've seen an explanation (that I don't fully buy), that school teachers end most sentences with a question because they're trying to get the children? the children? to complete? their sentence.
beeforpork2 days ago
What it is interesting that the original title is not a question?
- beeforpork2 days ago
  Sorry, this was redundant?
FailMore2 days ago
Thanks for the post, I’m interested in them too
monroewalker2 days ago
See also this recent post about Mercury-Coder from Inception Labs. There's a "diffusion effect" toggle for their chat interface but I have no idea if that's an accurate representation of the model's diffusion process or just some randomly generated characters showing what the diffusion process looks like
https://news.ycombinator.com/item?id=43187518
https://www.inceptionlabs.ai/news
inverted_flag2 days ago
How do diffusion LLMs decide how long the output should be? Normal LLMs generate a stop token and then halt. Do diffusion LLMs just output a fixed block of tokens and truncate the output that comes after a stop token?
Philpax3 days ago
I know the r-word is coming back in vogue, but it was still unpleasant to see it in the middle of an otherwise technical blog post. Ah well.
Diffusion LMs are interesting and I'm looking forward to seeing how they develop, but from playing around with that model, it's GPT-2 level. I suspect it will need to be significantly scaled up before we can meaningfully compare it to the autoregressive paradigm.
- IncreasePosts2 days ago
  Retarded is too good of a word to go unused. It feels super wrong to call a mentally disabled person retarded or a retard. And we're told we can't call stupid things retarded. So who gets to use it? No one?
  With gay, on the other hand, gay people call each other gay and are usually okay being labeled as gay. So, it's still in use, and I think it's fine to push back against using it to mean "lame" or whatever.
  Finally, you should keep in mind that the author may not be American or familiar with American social trends. "Retarded" might be just fine in South Africa or Australia(I don't know). Similar to how very few Americans would bat an eye at someone using the phrase "spaz out", whereas it is viewed as very offensive in England.
  - kazinator2 days ago
    If you have a burning urge to use "retarded" with complete dick-o-matic immunity, try a sentence like, "the flame retardant chemical successfully retarded the spread of the fire". You may singe a few eyebrows, that's about it.
  - billab9952 days ago
    Might seem like a descriptive word but the fact is, it's hurtful to people who are working harder to make their way in life than I'll ever have to. Even when just heard in passing.
    Why do things in life that will hurt someone who'll likely just retreat away rather than confront you. Be the good guy.
    mitthrowaway22 days ago
    That's the euphemism treadmill though, isn't it? "Retard" literally means late or delayed (hence French: en retard). Back when it was originally introduced to refer to a handicap, it was chosen for that reason to be a kind, polite, and indirect phrasing. That will also be the fate of any new terms that we choose. Hence for example in physics the term retarded potential (https://en.wikipedia.org/wiki/Retarded_potential) was chosen to refer to the delaying effect of the speed of light on electromagnetic fields, before the word had any association with mental disability.
    Words don't need to retain intrinsic hurtfulness; their hurtfulness comes from their usage, and the hurtful intent with which they are spoken. We don't need to yield those words to make them the property of 1990s schoolyard bullies in perpetual ownership.
    To that extent I'd still say this article's usage is not great.
    barrkel2 days ago
    > Words don't need to retain intrinsic hurtfulness; their hurtfulness comes from their usage, and the hurtful intent with which they are spoken.
    Yes; and a rose by any other name would smell as sweet.
    Words don't need to retain intrinsic hurtfulness, but it's not quite right that the hurtfulness comes from the usage either. The hurtfulness comes from the actual referent, combined with intent.
    If I tell someone they are idiotic, imbecilic, moronic, mentally retarded, mentally handicapped, mentally challenged, I am merely iterating through a historical list of words and phrases used to describe the same real thing in the world. The hurt fundamentally comes from describing someone of sound mind as if they are not. We all know that we don't want to have a cognitive disability, given a choice, nor to be thought as if we had.
    The euphemism treadmill tries to pretend that the referent isn't an undignified position to be in. But because it fundamentally is, no matter what words are used, they can still be used to insult.
    t-32 days ago
    Any word used to describe intellectual disability would be just as hurtful, at least when given enough time to enter the vernacular. That's just how language and society works. Children especially can call each other anything and make it offensive, because bullying and cliquish behavior is very natural and it's hard to train actual politeness and empathy into people in authoritarian environments like schools.
    billab9952 days ago
    You're right, it's the intent that matters. <any_word>, used to describe something stupid or negative while also being an outdated description for a specific group of people...
    The fact is, it's _that_ word that's evolved into something hurtful. So rather than be the guy who sticks up for the_word and try convince everyone it shouldn't be hurtful, I just decided to stop using it. The reason why I stopped was seeing first hand how it affected someone with Down Syndrome who heard me saying it. Sometimes real life beats theoretical debate. It's something I still feel shame about nearly 20 years later.
    It wasn't a particularly onerous decision to stop using it, or one that opened the floodgate of other words to be 'banned'. And if someone uses it and hasn't realized that, then move on - just avoid using it next time. Not a big deal. It's the obnoxious, purposefully hurtful use of it that's not great (which doesn't seem to be the case here tbh). It's the intent that matters more.
    2 days ago
    undefined
  - whoami_nr2 days ago
    Yes, I am not American and I had no clue about the connotations.
- mountainriver3 days ago
  Meta has one based on flow matching that is bigger, it performs pretty well
  - gsf_emergency_22 days ago
    A possible detente between SCHMIDHUBER & the school of Yann Lecun ?
    https://doi.org/10.1103/PhysRevLett.129.228004
    2 days ago
    undefined
- gsf_emergency_23 days ago
  I've got a couple more related snowclones..
  Sufficiently humourous sneering is indistinguishable from progress
  Sufficiently high social status is indistinguishable from wisdom
  - 2 days ago
    undefined
  - gsf_emergency_22 days ago
    Sufficiently profane reasoning is indistinguishable from autoregression
    Sufficiently anti-regressive compression is indistinguishable from sentience (--maybe the SCHMIDHUBER)
    https://psycnet.apa.org/record/2007-12667-001
- echelon2 days ago
  > I know the r-word is coming back in vogue
  This is so utterly fascinating to watch.
  Three years ago this would have cost you your job. Now everybody's back at it again.
  What is happening?
  - esperent2 days ago
    For anyone else confused, this "r-word" is "retarded".
    They're not talking about a human. To me that makes it feel very different.
    However, there's also a large component coming from the current political situation. People feel more confident to push back against things like the policing of word usage. They're less likely to get "cancelled" now. They feel more confident that the zeitgeist is on their side now. They're probably right.
    bongodongobob2 days ago
    Eh, I'm as left as they come and I'm tired of pretending that banning words solve anything. Who's offended? Why? Do you have a group of retarded friends you hang out with on the regular? Are they reading the article? No and no. Let's not pretend that changing the term to differentently abled or whatever has any meaning. It doesn't. It's a handful of loud people (usually well off white women) on social media dictating what is and isn't ok. Phrases like "temporarily unhoused" rather than homeless is another good way to pretend to be taking action when you're doing less than nothing. Fight for policy, not changing words.
    bloomingkales2 days ago
    Do you have a group of retarded friends you hang out with on the regular?
    I should not have laughed at this.
    esperent2 days ago
    > I'm as left as they come and I'm tired of pretending that banning words solve anything. Who's offended? Why?
    I'm with you on this, also speaking as a strong leftist.
    I do think that "banning" , or at least strongly condemning, the use of words when the specific group being slurred are clear that they consider it a slur and want it to stop is reasonable. But not when it's social justice warriors getting offended on behalf of other people.
    However, I think it's absolutely ridiculous that even when discussing the banning of these words, we're not allowed to use them directly. We are supposed to say "n-word", "r-word" even when discussing in an academic sense. Utter nonsense, it's as if saying these words out loud would conjure a demon.
    imtringued2 days ago
    The point of these meaningless dictionary changes isn't to solve anything. It's to give plausible deniability to asshole behaviour through virtue signalling.
    Crazy assholes will argue along the lines that it is an insignificant inconvenience and hence anyone who uses the old language must use it maliciously and on purpose, because they are ableist, racist or whatever.
    This then gives assholes the justification to behave like a biggot towards the allegedly ableist person. The goal is to dress up your own abusive bullying as virtuous, even though deep down you don't actually care about disabled people.
    esperent2 days ago
    This is an interesting take, and I think it's not unreasonable to label the worst of the social justice warriors as assholes.
    However, most of them are well meaning. They're misguided rather than assholes. They really do want to take action for social improvement. It's just that real change is too hard and requires messy things like protesting on the street or getting involved in politics and law. So, they fall back on things like policing words, or calling out perceived bad actors, which they can do from the comfort of their homes via the internet.
    To be fair, some genuinely bad people have been "cancelled". The "me too" movement didn't happen without reason. It's just that it went too far, and started ignoring pesky things like evidence, or innocent until proven otherwise.
    Uehreka2 days ago
    Yes and yes? I’m an AI enthusiast interested in the article and I’m offended by that word for pretty non-hypothetical reasons. When I was in middle school I was bullied a lot by people who would repeatedly call me the r-slur. That word reminds me of some of the most shameful and humiliating moments of my life. If I hear someone use it out of nowhere it makes me wince. Seeing it written down isn’t as bad, but I definitely would prefer people phased it out of their repertoire.
    esperent2 days ago
    I would also like the whole world to change so that I don't have to face my personal traumas. But, since that's not gonna happen, I have to deal with them in other ways.
    CBT therapy, especially the "B" part, was essentially created to help overcome phobias of things like words. There are some great books on CBT, and also research showing that working alone using a book can often be as effective as working with a therapist. The classic Feeling Good by David Burns, while the case studies are a bit dated, is still an amazing book.
    Uehreka2 days ago
    I honestly forgot how much the word affected me until recently because for about 20 years I almost never heard anyone use it. So apparently it is in fact possible for people to by and large stop using it.
    2 days ago
    undefined
  - inverted_flag2 days ago
    The zeitgeist is shifting away from "wokeness" and people are testing the waters trying to see what they can get away with saying now.
  - exe342 days ago
    Elon Musk made it cool again.
- 2 days ago
  undefined
fsfsdfads3 days ago
[dead]
fdsafd3 days ago
[dead]
kazinator2 days ago
[flagged]
billab9953 days ago
Stopped reading at the r word. Do better.