>Moreover, we follow previous work in accelerating block breaking because learning to hold a button for hundreds of consecutive steps would be infeasible for stochastic policies, allowing us to focus on the essential challenges inherent in Minecraft.
Mostly I'm tired of RL work being oversold by its authors and proponents by anthropomorphizing its behaviors. All while this "agent" cannot reliably learn to hold down a button, literally the most basic interaction of the game.
While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).
[1] And you certainly can't reproduce the observation selection effect in a laboratory. That is the thing that makes it possible to overcome the "no free lunch" theorem: our existence and intelligence are conditional on evolution being possible and finding the right biases.
We have to bake in inductive biases to get results. We have to incentivize behaviors useful (or interesting) to us to get useful results instead of generic exploration.
Its actions are not muscular, they are literal gameplay actions. It is orders of magnitude easier to learn that the same action should be performed until completion, than that the finger should be pressed against a surface while the hand is stabilized with respect to the cursor on a screen.
One of the most interesting (and pathological) things about humans is that we learn what is rewarding. Not how to get a reward, but actually we train ourselves to be rewarded by doing difficult/novel/funny/etc things. Notably this is communicated largely by being social, i.e., we feel reward for doing something difficult because other people are impressed by that.
In Castaway, Hanks' only companion is a mute, deflated ball, but nonetheless he must keep that relationship alive---to keep himself alive. The climax of the movie is when Hanks returns home and people are so impressed, his efforts are validated.
Contrast that to RL, there is no intrinsic motivation. The agents do not play, or meaningfully explore, really. The extent of its exploration is a nervous tic that makes it press the wrong button with probability ε. The reason it cannot hold down buttons is because it explores by having Parkinson's disease, by accident, not because it thought it might find out something useful/novel/funny/etc. In fact, it can't even have a definition of those words, because they are defined in the space between beings.
[1] definition: https://en.wikipedia.org/wiki/Small-world_network
[2] evidence for our brains: https://www.semanticscholar.org/paper/Small-world-directed-n...
As others already pointed it's not an intrinsic limitation of RL agents.
> In fact, it can't even have a definition of those words, because they are defined in the space between beings.
In fact an agent doesn't need to know definitions to act. A bacterium don't know what it means to reproduce, but reproduces anyway.
https://pathak22.github.io/noreward-rl/
and the followup which address the noise impredictability problem.
there are more after that which i believe fail the black pill and miss the point of ml, asicifying the architecture with human priors. But the broader point is to show that rl is not just discovering solutions by chance in random actions. Nature starts with priors, and curiosity is one of the universal policy bootstrapping techniques. (others might be imitation, next state prediction, total nearby replication count)
There is also a paper that deployed ICM on a physical robot and it just played with a ball because it was the only source of novel stimuli, and inadvertantly learned how to operate its arms. There was no other reward in the environment except for curiosity. It is amazing, and slightly creepy. I think the ICM will be rediscovered later in ML tech.
What's interesting to me about this is that the problem seems really aligned with the research they are doing. From what I can tell, they build a system where the agent has a simplified "mental" model of the game world and it uses to predict actions that will lead to better rewards.
I don't think what's missing here is teaching the model that it should just try to do things a lot until they succeed. Instead, what I think is missing is the context that it's playing a game, and what that means.
For example, any human player who sits down to play minecraft is likely to hold down the button to mine something. Younger children might also hold the jump button down and jump around aimlessly, but older children and adults probably wouldn't. Why? I suspect it's because people with experience in video games have set expectations for how game designers communicate the gameplay experience. We understand that clicking on things to interact with them is a common mode of interaction, and we expect that games have upgrade mechanics that will let us work faster or interact with high level items. It's not that we repeat any action arbitrarily to see that it pays off, but rather that we're speaking a language of games and modeling the mind of the game designers and anticipating what they expect from us.
I would think that trying to expand the model of the world to include this notion of the language of games might be a better approach to overcoming the limitation instead of just hard-coding the model to try things over and over again to see if there's a payoff.
“AlphaZero was trained solely via self-play using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks, all in parallel, with no access to opening books or endgame tables. After four hours of training, DeepMind estimated AlphaZero was playing chess at a higher Elo rating than Stockfish 8; after nine hours of training, the algorithm defeated Stockfish 8 in a time-controlled 100-game tournament (28 wins, 0 losses, and 72 draws).” [emphasis added] https://en.wikipedia.org/wiki/AlphaZero
Nevertheless, the theorem basically states that there are games where AlphaZero will be beaten by another algorithm. Even if those games are nonsensical from our point of view.
Of course, there are "games" like "invert sha-512" that can be implemented in our world but are probably impractical to learn. But NFL has nothing to say about them; a game that simple has zero measure in a uniform distribution over problems.
Haha, I wouldn’t feel bad. It’s one of the most misunderstood theorems, and I don’t think I’ve ever seen it invoked correctly on a message board.
If you code a reward function for each step necessary to get a diamond, you are teaching the AI how to do it. There is no other way to look at it. Its extremely unethical to claim, as Nature does, that it did this without "being taught", and it is in my eyes academic malpractice to claim, as their paper does, that it did this "without human data or curricula"; though mitigated by the reality that they admit this in the paper. If this is the case; I am still digesting the paper, as it is quite technical.
This isn't an LLM, I'm aware of this, but I am at the point where if I could bet on the following statement being true, I'd go in at five figures: Every major AI benchmark, advancement, or similar accomplishment in the past two years can almost entirely be explained by polluted training data. These systems are not nearly as autonomously intelligent as anyone making money on them says they are.
Really? Minecraft's gameplay dynamic are not particularly complex... The AI here isn't learning highly complex rules about the nuances of human interaction or learning to detect the relatively subtle differences between various four legged creatures based on small differences in body morphology. In these cases I could see how millions of years of evolution is important to at least give us and other animals a head start when entering the world. If the AI had to do something like this to progress in Minecraft then I'd get why learning those complexities would be skipped over.
But in this case a human would quickly understand that holding a button creates a state which tapping a button does not, and therefore would assume this state could be useful to explore further states. Identifying this doesn't seem particularly complex to me. If the argument is that it will take slightly longer for an AI to learn patterns in dependant states then okay, sure, but I think arguing that learning that holding a button creates a new state is such a complex problem that we couldn't possibly expect an AI to learn it from scratch within a short timeframe is a very weak argument. It's just not that complex. To me this suggests that current algorithms are lacking.
Until then, it is a question of if we can capture the appearance of agency in the training set well enough for learn it with training and not depend upon interactions to learn more.
Tasks such as:
- recognizing objects in our surroundings,
- speaking,
- reasoning about other people's thoughts and feelings,
- playing go?
All of those were at some point "easy for us but very hard for computer programs".I'd argue if you consider the size of the input and output space here it's not as complex you're implying.
To refer back to my example, to tell the difference between four legged creatures is complicated because there's a huge number of possible outputs and the visual input space is both large and complex. Learning how to detect patterns in raw image data is complicated and is why we and other animals are preloaded with the neurological structures to do this. It's also why we often use pretrained models when training models to label new outputs – simply learning how detect simple patterns in visual data is difficult enough so if this step can be skipped it often makes sense to skip it.
In constrast the inputs to Minecraft are relatively very simple – you have a handful of buttons which can be pressed and those buttons can be pressed for different durations. Similarly the output space here while large is relatively simple and presumably simply detecting that an action like holding a button results in a state change shouldn't be that complex to learn... I mean it's already learning that pressing a button results in a state change so I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable. Maybe I'm missing something.
As far as I understand DreamerV3 doesn't employ intrinsic rewards (like in novelty-based exploration). It adopts stochastic exploration which makes it practically impossible to get to rewards that require to consistently repeat an action with no intermediate rewards.
And finding intrinsic rewards that work good across diverse domains is a complex problem in itself.
I think you underestimate complexity of going from 12288+400 changing numbers to a concept of gameplay dynamics in the first place. Or in other words your complexity prior is biased by experience.
But for an algorithm learning from scratch in Minecraft, it's more like having to guess the cheat code for a helicopter in GTA, it's not something you'd stumble upon unless you have prior knowledge/experience.
Obviously, pretraining world models for common-sense knowledge is another important research frontier, but that's for another paper.
I suppose whether you find this result intriguing or not depends on if you’re looking to build result-building planning agents over an indeterminate (and sizable!) time horizon, in which case this is a SOTA improvement and moderately cool, or if you’re looking for a god in the machine, which this is not.
When RL works, it really works.
The only alternative I have seen is deep networks with MCTS, and they are quickly to ramp up to decent quality. But they hit caps relatively quickly.
> In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
> “What are you doing?”, asked Minsky.
> “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.
> “Why is the net wired randomly?”, asked Minsky.
> “I do not want it to have any preconceptions of how to play”, Sussman said.
> Minsky then shut his eyes.
> “Why do you close your eyes?”, Sussman asked his teacher.
> “So that the room will be empty.”
> At that moment, Sussman was enlightened.
I have a running bet with friend that humans encode deterministic operations in neural networks, too, while he thinks there has to be another process at play. But there might be something extra helping our neural networks learn the strong weights required for it. Or the answer is again: "more data".
This is fine, and does not impact the importance of figuring out the steps.
Anybody that has done any tuning on systems that run at different speeds, the adjusting for the speed difference is just engineering, and allows you to get on with more important/inventive work.
In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.
This is also why I think the title of the article is slightly misleading.
Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?
Machine learning with world models is very interesting, and the people doing it don't seem to say much about what the models look like. The Google manipulation work talks endlessly about the natural language user interface, but when they get to motion planning, they don't say much.
Here's the "full sized" image set.[1] The world model is low-rez images. That makes sense. Ask for too much detail and detail will be invented, which is not helpful.
[1] https://media.springernature.com/full/springer-static/image/...
How exactly does your machine learning model work?
This world model stuff is only possible in environments that are sandboxed. Ie you can represent the state of the world in an and have a way of producing the next state given a current state and action. Things like Atari games, robot simulations, etc
I imagine it's the latter, and in general, we're already dealing with plenty of models with world models hidden inside their weights. That's why I'm happy to see the direction Anthropic has been taking with their interpretability research over the years.
Their papers, as well as most discussions around them, focus on issues of alignment/control, safety, and generally killing the "stochastic parrot" meme and keeping it dead - but I think it'll be even more interesting to see attempts at mapping how those large models structure their world models. I believe there's scientific and philosophical discoveries to be made in answering why these structures look the way they do.
This was clearly the goal of the "Biology of LLMs" (and ancillary) paper but I am not convinced.
They used a 'replacement model' that by their own admission could match the output of the LLM ~50% of the time, and the attribution of cognition related labels to the model hinges entirely on the interpretation of the 'activations' seen in the replacement model.
So they created a much simpler model, that sorta kinda can do what the LLM can do in some instances, contrived some examples, observed the replacement model and labeled what it was doing very liberally.
Machine learning and the mathematics involved is quite interesting but I don't see the need to attribute neuroscience/psychology related terms to them. They are fascinating in their own terms and modelling language can clearly be quite powerful.
But thinking that they can follow instructions and reason is the source of much misdirection. The limits of this approach should make clear that feeding text to a text continuation program should not lead to parsing the generated text for commands and running these commands, because the tokens the model outputs are just statistically linked to the tokens inputted to them. And as the model takes more tokens from the wild, it can easily lead to situations that are very clearly an enormous risk. Pushing the idea that they are reasoning about the input is driving all sorts of applications that seeing them as statistical text continuation programs would make clear are a glaring risk.
Machine learning and LLMs are interesting technology that should be investigated and developed. Reasoning by induction that they are capable of more than modelling language is bad science and drives bad engineering.
If you watch a guide on how to find diamonds it's really just a matter of getting an iron pickaxe, digging to the right depth and strip mining until you find some.
It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2
It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.
Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.
This is such gold. Thanks for sharing. Immediately added to my notes.
Thanks for your hard work.
> log, plank, stick, crafting table, wooden pickaxe, cobblestone, stone pickaxe, iron ore, furnace, iron ingot, iron pickaxe and diamond
"This was given to the AI by having it watch a video that explains it."
This was not as trivial as it may seem just a few months ago...
it didn't watch 'a video', it watched many, many hours of video of playing minecraft (with another specialised model feeding in predictions of keyboard and mouse inputs from the video). It's still a neat trick, but it's far from the implied one-shot learning.
The YouTube documentary is actually very detailed about how they implemented everything.
That is not what the article says. It says that was separate, previous research.
Have you gotten used to some ai watching a video and 'getting it' so fast that this is boring? Unimpressive?
I do wonder though: if you swapped Minecraft for a cloud-based synthetic world with similar physics but messier signals, like object permanence or social reasoning, would Dreamer still hold up? Or is it just really good at the kind of clean reward hierarchies that games offer?
Mining diamonds isn't even necessary if you build, e.g., ianxofour's iron farm on day one and trade that iron[0] with a toolsmith, armourer, and weaponsmith. You can get full diamond armour, tools, and weapons pretty quickly (probably a handful of game weeks?)
[0] Main faff here is getting them off their base trade level.
Really though, if AI wants to impress me it needs to collect an assortment of materials and build a decent looking base. Play the way humans usually play.
>> In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.
And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.
For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.
Take children to school -> they safely arrive on time.
Autonomous driving -> arrive at destination without crashing.
Call centre -> customers are happy.
Or maybe there is some art to finding happiness in simple things like having folded clothes or surviving the commute?
I guess you can always find some well-specified, measurable goal/reward, but then that choice limits the performance of your model. It's fine when you're building a very specialized system; it gets more difficult the more general you're trying to be.
For a general system meant to operate in human environment, the goal ends up approaching "things that humans like". Case in point, that's what the overall LLM goal function is - continuations that make sense to humans, in fully-general meaning of that.
>> Take children to school -> they safely arrive on time.
>> Autonomous driving -> arrive at destination without crashing.
>> Call centre -> customers are happy.
Define a) "folded", b) "safely", c) "destination", d) "happy".
Also define the reward functions for each of the four objectives above.
Destination -> Like, close to the destination? I don't see how that's hard.
Happy -> you can use customer feedback for this
Folded -> this is indeed the trickiest one, but I think well within the capabilities of modern vision models.
Really? What about fires? Falling off cliffs? Causing others to crash?
Your "examples" are all hand-wavy and vague and no good to train an RL agent. You've also not provided a reward function.
Of course the delay is much bigger with working a job than most RL games but fundamentally it's the same problem.
I encourage you to read deepmind's work with robots.
>> Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.
https://research.google/blog/scalable-deep-reinforcement-lea...
That was in 2018.
So what do you think, is vision-based robotic manipulation and grasping a solved problem, seven years later? Is QT-Opt now an established industry standard in training robots with RL?
Or was that just another project that was announced with great fanfare and hailed as a breakthrough that would surely lead to great increase of capabilities... only to pop, fizzle and disappear in obscurity without any real-world result, a few years later? Like most of DeepMind's RL projects do?
https://www.youtube.com/watch?v=x-exzZ-CIUw
It looks pretty awesome. Let's see what happens.
https://youtu.be/03p2CADwGF8?si=BXeWXqu1_3WMS4yy
A robot assembling a puzzle with machine vision!
And it's only from the 1970's.
But, like you say- let's wait and see. I always do the former but I'm still waiting for the latter.
Tell that to the people here who are trying to turn their startup ideas into money.
DeepSeek used RL to train R1, so that is clearly not true. But ignoring that, what is your alternative? Supervised learning? Good luck finding labels if you don’t even know what the objective is.
And why do I have to offer an alternative? If it's not working, it's not working, regardless of whether there's an alternative (that we know of) or not.
But I remember the alpha version, and NOBODY knew how to make a pick ax. Humans were also very bad at figuring out these steps.
People were de-compiling the java and posting help guides on the internet.
How to break a tree, get sticks, make a wood pick. In Alpha, that was a big deal for humans also.
It seems rather tenuous to keep pounding on 'training via pixels' when really a game's 2D/3D output is an optical trick at best.
I understand Sergey Brin/et al had a grandiose goal for DeepMind via their Atari games challenge - but why not try alternate methods - say build/tweak games to be RL-friendly? (like MuJoCo but for games)
I don't see the pixel-based approach being as applicable to the practical real world as say when software divulges its direct, internal state to the agent instead of having to fake-render to a significantly larger buffer.
I understand Dreamer-like work is a great research area and one that will garner lots of citations for sure.
Because the ultimate goal (real-world visual intelligence) would make that impossible. There's no way to compute the "essential representation" of reality, the photons are all there is.
Visual cortex and plenty of other organs compress the data into useful, semantic information before feeding into a 'neural' network.
Simply from an energy and transmission perspective an animal would use up all its store to process a single frame if we were to construct such an organism based on just 'feed pixels to a giant neural network'. Things like colors, memory, objects, recognition, faces etc are all part of the equation and not some giant neural network that runs from raw photons hitting cones/rods.
So this isn't biomimicry or cellular automata - it's simply a fascination similar to self-driving cars being able to drive with a image -> {neural network} -> left/right/accelerate simplification.
Also I believe the DeepMind StarCraft model used the compressed representation, but that was a while ago. So that was already kind of solved.
> simply a fascination similar to self-driving cars being able to drive with a image
Whether to use lidar is more of an engineering question of the cost/benefit of adding modalities. LiDAR has come down in price quite a bit so it’s less wise in retrospect.
- Millions of years of evolution (and hence things like walking/swimming/hunting are usually not acquired characteristics even within mammals)
- Memory - and I don't mean the neural network raw weights. I mean concepts/places/things/faces and so on that is already processed and labeled and ready to go.
- Also we don't know what we don't know - how do cephalopods/us differ in 'intelligence'?
I am not trying to poo-poo the Dreamer kind of work: I am just waiting for someone to release a game that actually uses RL as part of the core logic (Sony's GT Sophy comes close).
Such a thing would be so cool and would not (necessarily) use pixels as they are too far downstream from the direct internal state!
Adding more than 1 GPU didn't improve speed but that's pretty standard as we don't have fancy interconnect. Bit annoying they didn't use tensorboard for logging, but overall seems like a pretty cool lib - will leave it a few days and see if it can learn (no other algo has so I dont have much hope).
https://www.youtube.com/@EmergentGarden
I very much like the comparative approach this guy takes looking at how different LLMs fare... including how they interact together. Worth a look.
If all you’re going to do is parrot things like human consciousness or human ingenuity then I will never be impressed so long that it’s just parroting.
... but the first video only shows the player character digging downwards without using any tools and eventually dying in lava. What?
The tools are admittedly really hard to see in the videos because of the timelapse and MP4 struggles a bit on the low resolution, but they are there :)
Isn't something like finding dimonds in minecraft something that old-school AI could already do decently?
to a bright future
where we focus 100% on work
and AI will play our games
/s
The message this sends is pretty clear: machines are better at thinking, humans are better at manual work. That is the natural division of labor that plays into strengths and weaknesses of both computers and human beings.
And so, I'm sorry to say this, but the near future is that in which computers play our games and do the thinking and creative work and management (and ultimately governance), because they're going to be better at this than us, leaving us to do all the physical labor, because that's one thing we will remain better at for a while.
That, or we move past the existing economic structures, so that we no longer need to worry about being competitive with AI labor.
/s, but only a little.
Lol that's crazy optimistic, what work ?
It's only hilarious because we're allowed to laugh. For now. Wait a few years its possible these things will demand respect.