Code is text. LLMs are text input/output machines.
Game input/output is not at all text.
LLMs can certainly reason about games with a simple/explicit enough domain (try a risk tournament where models can talk to each other between turns!)
I have yet to see any sort of harness that lets a frontier LLM interact with a text adventure and make meaningful progress on its own.
ARC-AGI-3 shows this: https://arcprize.org/arc-agi/3
I've done some work as well on Rogue (sorry for self-promotion): https://iwhalen.github.io/rogue-bench/
Games are a bunch of tasks too.
So if they fail at game tasks maybe it’s a bad idea to advertise those LLMs as task doing assistants.
Unfortunately it doesn't seem to fit in some people's context because it was a few years ago.
Kind reminder: there is "AI" beyond LLMs.
I was really proud of it at the time because I had to do a decent amount of reading and research since I wrote all of the NN code from scratch and wanted to add some more advanced algorithm optimisations which I hadn't done in previous projects.
I suspect a coding agent could spit the entire project out in 20 minutes now, but it was very cool at the time to build a game then watch my computer learn how to play it in real time.
[1]: https://www.lowimpactfruit.com/p/zork-bench-an-llm-reasoning...
[2]: https://entropicthoughts.com/evaluating-llms-playing-text-ad...
For a more recent test, see https://kenforthewin.github.io/blog/posts/nethack-agent/ .
This is a tall order for an LLM: it needs a lot of context but most of the context will be just noise.
Unless the goal was to test how well do the large language models translate solutions in prose to actionable keyboard inputs, which is pretty interesting in itself.
I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.
I am happy to see an another approach - and indeed, with much stronger results.
While I did implement a more comprehensive harness with path finding tools etc. the models themselves have improved significantly.
Personally, I think this is a really hard problem, and it may turn out to be one of the first big walls we hit on the road to AGI.
I admit I know nothing about this though.
Not many industries except perhaps writing have had that advantage, in many ways coding is one of the best case scenarios for LLMs.
Something like snake or tic-tac-toe is straightforward.
Choose another “AI” technology and give another go.
Imagine if you can bring those AI players to CS 1.6.
If you want to implement actual bots inside the game, then you want to use explicit logic instead of inferred logic. It's much more efficient and easier to debug.
If you want to create Bots for an existing game, which doesn't have its own pre-programmed bots, then you should look at other types of AI. See https://www.geeksforgeeks.org/deep-learning/reinforcement-le...
From what I saw, even if you frame advance every single frame, they still don't seem to grasp the concept of "I need to hold down this button for a few frames until x happens"...
There's no concept of time, just a never ending state machine thats constantly changing state.
> Togelius: It’s super weird.
...No, it's really not.
They're language models. Code is a language. "Playing a game well" is not. One can, hypothetically, encode game inputs in such a way that it seems kinda-sorta like a language, but it has none of the same kinds of structures that languages—both human and programming—do.
The only way one can think this is strange is if one thinks of LLMs' ability to code rudimentary games as being due to a deeper understanding of games, rather than due to game code being well-represented in their training data.
If LLMs get better but do not progress at playing games when not specifically trained on it it seems to point to a generalisation failure, a limitation that would prevent LLMs to ever achieve AGI, I do not know if that is weird but it seems that for now nobody really knows if they can achieve AGI or not. Perhaps some emergent behavior will arise after more scaling.
To me it's only totally unsurprising if you are 100% certain that LLMs will never reach AGI (like LeCun thinks for example).