Any human endeavor that can be objectively verified in some environment like this can be completely automated
I'm looking forward to finding out what model is optimal on my rtx3090
One thing I'm concerned with is that the model with best bpb after 5 minutes in smaller setups are only about ~10M Parameters in size which is too small for some emergent effects.
- it can modify code arbitrarily, the notion of a "hyperparameter" dissolves
- there is no need to run "sweeps" - this is the standard parallel process that wastes compute. because LLM agents are sequential, they can do more efficient versions such as binary search to narrow in on the right setting very quickly (usually many parameters will have a U shaped optimal setting).
- it's fully automatic, it doesn't require human in the loop to mess with the code.
You're right that many of the changes it seems to make out of the box (as I intentionally did not try to prompt engineer it too hard yet because I was curious what you get by default) seem to be tuning existing hyperparameters. not all of the changes are like that - e.g. it tried to replace the non-linearity, etc. I will say that overall (and again, out of the box) the LLM feels unwilling to creatively pursue a research direction or something like that. The models feel very "cagy" and "scared" when they are given problems that are a little too open ended. But that's just where the fun parts, e.g. I had some early successes with the idea of a "chief scientist" that was basically a never-ending plan mode that looked at what worked, didn't work, tried to find related code/papers, and created a long list of experiments to try, which it could then send to junior engineers running in tmux sessions. I think quite a few approaches are possible, so I think it's a nice canvas. The reason we're not getting "novel research" feels like half capability issue and half skill issue.
"You are Yann Lecun's last PhD candidate, and he hates you and you hate JEPA. You are determined to prove that a non-world model can reach AGI. In order to get your PhD you have to be creative and come up with new ideas. Remember without it, you're stuck."
The report in the linked repo is Claude Code generated.
https://github.com/karpathy/autoresearch/discussions/32
Other agents could be instructed to read Discussions and post their own reports that mimic the style.
EDIT: Not a good fit for nanograd. But my agent speculates that's because it spent so much more time on compute.
Once you have more than a handful of concurrent experiments, the question becomes: how does the chief scientist reliably get status from the juniors without polling tmux output constantly? And when a junior finds something surprising — a result that changes the research direction — how does that signal propagate back quickly enough to stop wasted compute on now-irrelevant branches?
The tmux channel works well for low concurrency. At higher concurrency it starts to look like the same problem as multi-agent coordination in production systems: you need something closer to pub/sub than session polling.
Curious how you're thinking about the feedback loop design as you scale the number of concurrent junior agents.