We know that the early tokens in an autoregressive sequence disproportionately bias the outcome. I would go as far as to say this is some of the magic of reasoning models is they generate so much text they can kinda get around this.
However, diffusion seems like a much better way to solve this problem.
The model is trained to encourage re-evaluating the soundness of tokens produced during the "thinking phase".
The model state vector is kept in a state of open exploration. Influenced by the already emitted tokens but less strongly so.
The non-reasoning models were just trained with the goal of producing useful output on a first try and they did their best to maximize that fitness function.
Think of it like writing a text by forcing your teacher to write for you by entering in the assignment 100 times. You begin by generating completely inaccurate text, almost random, that leans perhaps a little bit towards the answer. Then you systematically begin to correct small parts of the text. The teacher that sees the text, and uses red the red pen to correct a bunch of things. Then the corrected text is copied onto a fresh page, and resubmitted to the teacher. And again. And again. And again. And again. 50 times. 100 times. That's how diffusion models work.
Technically, it adds your corrections to the text, but that's mathematical addition, not adding at the end. Also technically every layer is a teacher that's slightly different from the previous one. And and and ... but this is the basic principle. The big advantage is that this makes neural networks slowly lean towards the answer. First they decide to have 3 sections, one about X, Y and one about Z, then they decide on what sentences to put, then they start thinking about individual words, then they start worrying about things like grammar, and finally about spelling and pronouns and ...
So to answer your question: diffusion networks can at any time decide to send out a correction that effectively erases the text (in several ways). So they can always start over by just correcting everything all at once back to randomness.
> dLLMs can generate certain important portions first, validate it, and then continue the rest of the generation.
If you pause the animation in the linked tweet (not the one on the page), you can see that the intermediate versions are full of, well, baloney.
(and anyone who has messed around with diffusion based image generation knows the models are perfectly happy to hallucinate).
However, autoregressive models that generate one token at a time are usually more accurate than parallel models that generate multiple tokens at a time.
In diffusion LLMs, both of these two effects interact. You can trade them off by determining how many tokens are generated at a time, and how many future tokens are used to predict the next set of tokens.
With that said, I'm still excited about diffusion -- if it offers different cost points, and different interaction modes with generated text, it will be useful.
I made a logical leap from there.
It brings up interesting questions, like what's the equivalency between smaller diffusion models which consume more compute because they have a greater number of diffusion steps compared to larger traditional LLMs which essentially have a single step. How effective is decoupling the context window size to the diffusion window size? Is there an optimum ratio?
As for the premise that models cant self correct that's not the argument i've ever seen, transformers have global attention across the context window. It's that their prediction abilities are increasingly poor as generation goes on. Is anyone having a different experience than that?
Everyone doing some form of "prompt engineering" whether with optimized ML tuning, whether with a human in the loop, or some kind of agentic fine tuning step, runs into perplexity errors that get worse with longer contexts in my opinion.
There's some "sweet spot" for how long of a prompt to use for many use cases, for example. It's clear to me that less is more a lot of the time
Now will diffusion fare significantly better on error is another question. Intuition would guide me to think more flexiblity with token-rewriting should enable much greater error correction capabilities. Ultimately as different approaches come online we'll get PPL comparables and the data will speak for itself
They don't? That's not the case at all, unless I am misunderstanding.
So I followed the link, and gave the model this bit of conversation starter:
> You still go mostly left to right.
The denoising animation it generated went like this:
> [Yes] [.] [MASK] [MASK] [MASK] ... [MASK]
and proceeded by deletion of the mask elements on the right one by one, leaving just the "Yes.".
:)
At some point in the future, you will be able to autogen a 10M line codebase in a few seconds on a giant GPU cluster.
Of course, these too will have scaling laws.
I don't remember article. I read it a decade ago. It's like he was doing diffusion in his mind, subconsciously perhaps
Something akin to ComfyUi but for LLMs would open up a world of possibilities.
Scroll down a bit on the website to see a screenshot.
Anyway, I think we'd expect it to usually be more-or-less left-to-right -- We usually decide what to write or speak left-to-right, too, and we don't seem to suffer much for it.
(Unrelated: it's funny that the example generated code has a variable "my array" with a space in it.)
Just looking at all of the amazing tools and workflows that people have made with ComfyUI and stuff makes me wonder what we could do with diffusion LMs. It seems diffusion models are much more easily hackable than LLMs.
eventually they run out of memory or patience
Diffusion LMs are interesting and I'm looking forward to seeing how they develop, but from playing around with that model, it's GPT-2 level. I suspect it will need to be significantly scaled up before we can meaningfully compare it to the autoregressive paradigm.
With gay, on the other hand, gay people call each other gay and are usually okay being labeled as gay. So, it's still in use, and I think it's fine to push back against using it to mean "lame" or whatever.
Finally, you should keep in mind that the author may not be American or familiar with American social trends. "Retarded" might be just fine in South Africa or Australia(I don't know). Similar to how very few Americans would bat an eye at someone using the phrase "spaz out", whereas it is viewed as very offensive in England.
Why do things in life that will hurt someone who'll likely just retreat away rather than confront you. Be the good guy.
Words don't need to retain intrinsic hurtfulness; their hurtfulness comes from their usage, and the hurtful intent with which they are spoken. We don't need to yield those words to make them the property of 1990s schoolyard bullies in perpetual ownership.
To that extent I'd still say this article's usage is not great.
Yes; and a rose by any other name would smell as sweet.
Words don't need to retain intrinsic hurtfulness, but it's not quite right that the hurtfulness comes from the usage either. The hurtfulness comes from the actual referent, combined with intent.
If I tell someone they are idiotic, imbecilic, moronic, mentally retarded, mentally handicapped, mentally challenged, I am merely iterating through a historical list of words and phrases used to describe the same real thing in the world. The hurt fundamentally comes from describing someone of sound mind as if they are not. We all know that we don't want to have a cognitive disability, given a choice, nor to be thought as if we had.
The euphemism treadmill tries to pretend that the referent isn't an undignified position to be in. But because it fundamentally is, no matter what words are used, they can still be used to insult.
The fact is, it's _that_ word that's evolved into something hurtful. So rather than be the guy who sticks up for the_word and try convince everyone it shouldn't be hurtful, I just decided to stop using it. The reason why I stopped was seeing first hand how it affected someone with Down Syndrome who heard me saying it. Sometimes real life beats theoretical debate. It's something I still feel shame about nearly 20 years later.
It wasn't a particularly onerous decision to stop using it, or one that opened the floodgate of other words to be 'banned'. And if someone uses it and hasn't realized that, then move on - just avoid using it next time. Not a big deal. It's the obnoxious, purposefully hurtful use of it that's not great (which doesn't seem to be the case here tbh). It's the intent that matters more.
Sufficiently humourous sneering is indistinguishable from progress
Sufficiently high social status is indistinguishable from wisdom
Sufficiently anti-regressive compression is indistinguishable from sentience (--maybe the SCHMIDHUBER)
This is so utterly fascinating to watch.
Three years ago this would have cost you your job. Now everybody's back at it again.
What is happening?
They're not talking about a human. To me that makes it feel very different.
However, there's also a large component coming from the current political situation. People feel more confident to push back against things like the policing of word usage. They're less likely to get "cancelled" now. They feel more confident that the zeitgeist is on their side now. They're probably right.
I should not have laughed at this.
I'm with you on this, also speaking as a strong leftist.
I do think that "banning" , or at least strongly condemning, the use of words when the specific group being slurred are clear that they consider it a slur and want it to stop is reasonable. But not when it's social justice warriors getting offended on behalf of other people.
However, I think it's absolutely ridiculous that even when discussing the banning of these words, we're not allowed to use them directly. We are supposed to say "n-word", "r-word" even when discussing in an academic sense. Utter nonsense, it's as if saying these words out loud would conjure a demon.
Crazy assholes will argue along the lines that it is an insignificant inconvenience and hence anyone who uses the old language must use it maliciously and on purpose, because they are ableist, racist or whatever.
This then gives assholes the justification to behave like a biggot towards the allegedly ableist person. The goal is to dress up your own abusive bullying as virtuous, even though deep down you don't actually care about disabled people.
However, most of them are well meaning. They're misguided rather than assholes. They really do want to take action for social improvement. It's just that real change is too hard and requires messy things like protesting on the street or getting involved in politics and law. So, they fall back on things like policing words, or calling out perceived bad actors, which they can do from the comfort of their homes via the internet.
To be fair, some genuinely bad people have been "cancelled". The "me too" movement didn't happen without reason. It's just that it went too far, and started ignoring pesky things like evidence, or innocent until proven otherwise.
CBT therapy, especially the "B" part, was essentially created to help overcome phobias of things like words. There are some great books on CBT, and also research showing that working alone using a book can often be as effective as working with a therapist. The classic Feeling Good by David Burns, while the case studies are a bit dated, is still an amazing book.