Ok, zero data, except the data used in the teacher model.
>> To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.
Giving the benefit of the doubt, they're just using it wrong, but the way they use it sure reads like they claim they found a way to initialise LLMs with 0 data. Only the absurdity of the claim protects the reader from such misunderstanding, and that's never a good thing in a research paper.
However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence
To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.
Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver.
Training a LLM is a multi-stage process[1], and they're tackling the stage at the end. That's where you do fine-tuning or reinforcement learning. They're not training a LLM from scratch. They're explicitly stating they start from a base LLM, ie a pretrained non-tuned model.
As I understand it, and as they mention, training data for the latter stages has typically required high-quality human-curated samples in large numbers, even if they're augmented using LLMs, say by generating multiple variations of each human-curated training sample.
Their proposal is to have a generative adversarial network generate that data without any initial human input, ie from scratch.
[1]: https://snorkel.ai/blog/large-language-model-training-three-...
Fair point. It would indeed have been much more clear had they written something like this instead:
a fully autonomous framework that generates its own fine-tuning/RL training data from scratch.
Ahh, GPT-4o is the arbiter.
So, basically, this is a way to perform LLM model compression (GPT-4o to qwen3) while maximizing the in-distribution domain size. As such, it seems reasonable and useful.
However the reliance on an arbiter LLM makes the claim that it will overcome the problem of a lack of training data unreasonable. Once the target LLM is scaled up to reach the in-distribution domain size of the arbiter, it seems to me it will turn back into a hallucination amplifier.
The solver/challenger is the GAN discriminator/generator.
The challenger is trained to create difficult questions. The solver is trained to strengthen pathways that correctly solve the questions like so:
> To guide the Challenger toward producing challenging yet solvable questions, we first define an uncertainty score. For a generated question x, we query the current Solver... The most frequent response is treated as the pseudo-label y˜(x), and we compute the Solver’s empirical accuracy....The uncertainty reward is then defined.... This function incentivizes questions where the Solver is maximally uncertain (accuracy approaches 50%)
Identifying the best pseudo-label seems like it would be the limitation of the approach.
Yes, I think this says in a different way what I'm trying to express.
In GAN, the Discriminator pegs the training to some chosen reality (assuming the "real" data set is truly real). In Challenger/Solver alone, there is no peg. The Solver could hallucinate consistently and "win" the race. It's the consistency that is the goal.
With GPT-4o as an arbiter of the Challenger/Solver training it provides the reality peg (or rather, the peg that biases toward GPT-4o's training set).
One network typically generates tasks for the other, and is rewarded if it manages to make the other network fail the task. The other network is rewarded if it successfully completes the task.
Thus the adversarial network tries to find weaknesses to exploit, and the combined training makes the solving network much stronger. Or at least that's the idea.
[1]: https://en.wikipedia.org/wiki/Generative_adversarial_network
Not scientifically perpetual. But definitely, relative to the finite future lifetime of the human race, operable for perpetuity.
And by extracting dark energy, we can not only turn the big rip around. But by pulling dark energy out of space in a linear direction ahead of a ship, we can power the ship to high speed, as we contract space in front of the ship. Like fusion, we can use extracted dark energy to extract more dark energy. Essentially smoothly teleporting forward. No more fundamental speed limits relative to observers at distance. Looking forward to exploring beyond the observable universe.
It isn't "free" though. There are unique risks.
It probably will eventually stop, though. Something about the Sun becoming a red giant...
(I feel like this post is underappreciated by at least 20%. :D )
This will work in a sense. It will do… something… and learn… something. It will be unrelated to the physical universe in any way. See also: procedural landscape generators, etc.
Many decades ago, statisticians made a similar erroneous assumption that maximum likelihood estimators, which also minimize entropy, are "optimal" in terms of saturating error. The fact that you can do better by smarter regularisation is the key to why DL works in the first place.
I'm no shill for AI, but you're going to need a better argument for why runaway AI up to obscene levels of performance is not theoretically possible. There are quite a few people, including some of my colleagues, that are looking in earnest but so far no one has found one.
There may be additional feedback loops, but fundamentally, that is what it is doing. Sure, it will show you what steps it takes to arrive at a conclusion, but it is just predicting the steps, the conclusion and the potential validity of the aforementioned based on its training data, not actually evaluating the logic or the truthiness of the output.
If you don’t believe me, ask your ”reasoning” LLM this question: What’s the name of the paternal great-great-grandfather of the son of Jacob’s son’s son’s son?
Or is there a more subtle issue which prevents or makes this hard?
Is there something fundamentally impossible about having a model detecting the amount of Rs in 'strawberry' to be a string search operation and in some sandbox execute something like:
% echo "strawberry" | tr -dc "r" | wc -c
3
It seems agents do this already, but regular GPT style environments seem to lack it?Anyway,. let me refresh my page, as I am sure while typing this some new model architecture is dropping. ;)
Start with Jacob.
Jacob’s son → call him A.
A’s son → call him B.
B’s son → call him C.
C’s son → call him D (this is “the son of Jacob’s son’s son’s son”).
Now the question asks for the paternal great-great-grandfather of D:
D’s father → C
D’s grandfather → B
D’s great-grandfather → A
D’s great-great-grandfather → Jacob
Answer: Jacob
Also next time you should bother to at least copy paste your questions into any recent LLM, since they can all solve it without issue. But hallucations like this are common with non-reasoning HN users.
Don’t think so. Humans solve that puzzle in a very different way than LLMs ”reason” about it.
(DeepThink did wonder if it was supposed to be him afterwards or if it was a trick.)
Adding a second question like ”Is Abraham included in the family tree?” still makes it regress into mentioning Isaac, Judah, Joseph, 12 sons and whatnot.
[1]: https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project