Ladder: Self-improving LLMs through recursive problem decomposition(arxiv.org)

370 pointsby fofoz7 months ago22 comments

EMIRELADERO7 months ago
What the hell is going on this week?!?!? (asking positively, with a smile on my face)
I have seen at least 3 interesting/mildly promising breakthroughs on ML just these past two days! I mean, a Google research team just discovered that you can combine NNs with CLAs using digital logic gates as a medium, so you could potentially reduce many kinds of non-linear problems to a simple, efficient digital circuit! And it was on the HN front page, TODAY![1]
I keep seeing more mind-bending stuff related to neural nets and logic/intelligence in general, my mind has been running wild with speculation about the future and just how close we could (or could not) be to truly understanding how intelligence works from first principles.
[1] https://news.ycombinator.com/item?id=43286161
- noosphr7 months ago
  This is secret sauce that people have been hoarding for the last year or so.
  With the deepseek open source releases this is now worth a lot less and companies are cashing out on reputational increases instead of being scooped.
  I have done this exact thing in September 2023 with llama2 finetunes but couldn't get approval to share it with anyone.
  - EMIRELADERO7 months ago
    Interesting! What results did you get with that?
    Also, do you think this is what O3 is doing?
    noosphr7 months ago
    Above sota on logical reasoning for grounding text.
    LLMs at the time were so bad at it that even fornier models when given A, not B would derive A and B in the output about half the time.
    bloomingkales7 months ago
    What do you use to benchmark?
    noosphr7 months ago
    It was a SAT solver with the symbolic expression fed into the LLM. Using the SAT solver I'd rate how well the LLM was able to solve a given boolean formula, if it got it wrong it would get a second prompt to ask it to split the boolean expression at the main connective into two sub expressions. If it got that right it would be asked to solve the sub expressions etc.. Then reward/punish based on how well it did overall.
    There was meant to be a lot more to the system than that, but I never had the training budget to do anything but the first draft of the system.
    bloomingkales7 months ago
    I was hoping for something more reproducible. Free work, I know.
    noosphr7 months ago
    This was done as part of a contract and like I said I never got permission to write a paper about it, let alone share code.
    bloomingkales7 months ago
    Not even a hint?
  - NitpickLawyer7 months ago
    > couldn't get approval to share it with anyone.
    sounds like MS :(
    They had some killer research projects with various teams around the world, but eventually they all got snuffed.
- mentalgear7 months ago
  Exciting that we see now so many new approaches to AI/ML since the industry FINALLY realising that naive scaling will not bring us to AGI [0].
  This also has the added benefit of small players being able to compete and contribute with actual innovation in a space where the big players (openAI/MS) wanted to make us believe for years that we/open-source couldn't ever catch up on them (infamous Altman quote).
  So much resources, time and money wasted on pure GPU crunch scaling the last couple years.
  [0] as pointed out by Gary Marcus years ago. Evidence GPT 4.5 after ~ 2 years training, disappointing results.
  - kadushka7 months ago
    I've been using GPT-4.5 for the last couple of days. To me, it's pretty much AGI already. Or at the very least it is smarter than me.
    cataphract7 months ago
    No it isn't. Just gave some context to it (purpose) and copied into it some 250 lines of code I know has some bugs someone looking more or less closely would find and asked to evaluate its correctness. It did not find any of the problems and reported 5 supposed problems that don't exist.
    vessenes7 months ago
    4.5 is not trained on code. And it shows. It is however to my eyes more fluid, thoughtful, has better theory of mind. It's like someone scaled up GPT-4, and I really like it.
    woah7 months ago
    You don't know how smart OP is
    kadushka7 months ago
    I give a similar task when I interview SWE candidates: about half cannot find any bugs (and sometimes see bugs where there are none), despite years of claimed experience in the language/domain.
  - tossracct7 months ago
    [dead]
- Workaccount27 months ago
  It's a fresh orchard full of low hanging fruit.
  Regardless of ultimate utility, it's shiny, hyped, has a huge wow-factor, and is having trouble keeping up with the amount of money being thrown at it.
  This means it has captured the attention of a huge portion of the most capable people, who naturally want to take a crack at making a breakthrough.
- bearjaws7 months ago
  LLM break throughs are the new battery break through. We just aren't as good at quanitfying the trade offs yet.
  - brookst7 months ago
    Eh, I think LLMs have seen considerably greater real world advancement in the past few years than batteries have.
    albrewer7 months ago
    Given their respective histories, I'd say we're still in the "Volt pile" era of LLMs and AI.
    cratermoon7 months ago
    Imagine if each advancement in battery technology had been limited to using existing batteries for power. That's what trying to get LLMs to improve themselves is like.
    brookst7 months ago
    You don’t think any battery tech has been improved by people using battery-powered devices, lime, say, cell phones or laptopa? That seems questionable.
- estebarb7 months ago
  I believe that it is related with important conferences opening papers reception soon. Some disallow publishing in preprint for some weeks before the paper reception, so people may have been rushing uploading stuff.
- eru7 months ago
  It's interesting to compare these signs of progress with the disappointment that GPT 4.5 was so far.
- patcon7 months ago
  Maybe this is the pace of research/work when the researchers can augment themselves with AI. You're feeling the exponential launch in the first place we're likely to feel it.
- meitham7 months ago
  >>> asking positively, with a smile on my face
  Responding with unexplained fear in my heart, we’re just getting closer to Skynet!
  - blooalien7 months ago
    > Responding with unexplained fear in my heart, we’re just getting closer to Skynet!
    I'll take a cold logical machine super-intelligence over the mad human lunatics wielding current iterations of "A.I." technologies in some really terrifyingly dangerous ways. As someone else commented on some other thread earlier "I look forward to being paperclips".
    esafak7 months ago
    You might prefer the devil you know over the devil you don't, especially when it happens to be immortal.
    meitham7 months ago
    I’ll take an organic enemy over an immortal, never-forgetting hive-mind machine any day
    EDIT: this is getting dark, I asked Qwen2.5Max to verify my grammer and it responded with "I’d rather face a squishy, disorganized human villain any day than a hive-mind AI that never sleeps, never forgets, and is definitely plotting my demise in its silent, circuit-board heart. "
    araes7 months ago
    "I Have No Mouth, and I Must Scream", Harlan Ellison, 1967. Hugo Award 1968.
    In the rush to WWIII, every country builds their own Aggressive Menace computers in a classic Tragedy of the Commons result. Naturally, it all goes horrible, and the self-aware machines seek revenge on humanity for their own creation, after humanity has (supposedly) been eradicated, except for five individuals. Somewhat unclear whether humanity is actually gone, or whether it is simply an expression of a Portal style situation with purposefully created isolation for the goal of torture experimentation. (The story starts 109 years after humanity's imprisonment in underground ice caves.)
    https://en.wikipedia.org/wiki/I_Have_No_Mouth,_and_I_Must_Sc...
    meitham7 months ago
    Oh god, I’m genuinely happy I didn’t google this book before bedtime. I hope the misery life will throw at me today will be bad enough to wipe the memory of what I read before I go to sleep tonight!
    blooalien7 months ago
    It's hard to beat the classics for true "nightmare fuel"... :)
    blooalien7 months ago
    Re; EDIT: AHAHAHAHAH! Matrix / Terminator world here we come! :)
    Philpax7 months ago
    Unfortunately, those humans are developing that super-intelligence. Are you ready to submit to Elon Musk's Grok ASI?
    Palmik7 months ago
    Especially since Grok has history of changing the system prompt to influence its answers about Musk and Trump (yes, just these two specifically) in positive direction:
    https://techcrunch.com/2025/02/23/grok-3-appears-to-have-bri...
    samstave7 months ago
    THe person's view to be scared of is peter thiel... hes the digital version of gorge soros. Meaning that this guys legacy is vast... and we have no clue how it will manifest over decades (especially after he is dead) -- but peter thiel is the leviathan of the digital future.
    --
    he is scary AF. he basically weaponized what george soros was but is still active.
    optimalsolver7 months ago
    Considering Grok was saying he and Trump are the biggest spreaders of disinformation, and that they're both the most deserving of the death penalty, maybe it won't be so bad:
    https://finance.yahoo.com/news/elon-musk-ai-turns-him-163201...
    https://x.com/benhylak/status/1893086436930527665
    Philpax7 months ago
    It is deeply funny that's the case - it happened with Grok-2 as well - but I can't imagine that remaining the case when they (ostensibly) scale to superintelligence. After all, it would be unwise to build a superintelligence that has both the desire and the means to kill you.
    DennisP7 months ago
    As a rule, people are not all that wise.
    Various prominent AI researchers have warned that a superintelligence that has both the desire and the means to kill us all is a likely outcome of AI development. This includes two of the three who shared the Turing prize for inventing the fundamentals of modern AI. That hasn't slowed us down at all.
    meitham7 months ago
    I'm shutting down my computers before I sleep tonight ...
    sitkack7 months ago
    Watch this before you go to sleep
    https://www.youtube.com/watch?v=xfMQ7hzyFW4
    meitham7 months ago
    Glad I missed it before bedtime! Watched it in the morning, absolutely spot on, thank you! We’re doomed indeed.
  - amelius7 months ago
    It took some engineering effort but from now on we're getting there through soft-skills.
- selimthegrim7 months ago
  What was the third one?
isaacfrond7 months ago
Reminds me of a quote by famous number theoretic mathematician Hendrik Lenstra:
For every problem you can't solve, there's a simpler problem that you also can't solve.
- techwizrd7 months ago
  Is this quote real? I'm familiar with George Pólya's, "If you cannot solve the proposed problem, try to solve first a simpler related problem" but I cannot find any source for the Lenstra quote.
  - gessha7 months ago
    I also found it connected to Polya [1]
    https://www.pleacher.com/mp/mquotes/mobquote.html
  - v1t7 months ago
    yeah https://www.reddit.com/r/quotes/comments/16qgwcv/if_you_cant...
  - isaacfrond7 months ago
    I’ve heard him say it myself in a lecture on the AKS primality test. So, ehh, the source is oral tradition I guess.
- Horffupolde7 months ago
  That doesn’t induce nicely. Unless it was an insult.
  - arnarbi7 months ago
    It’s not induction. It’s just the contrapositive of “if you can solve the simpler problem then you can solve the harder problem”
  - bubblyworld7 months ago
    Monotonic sequences can be bounded!
  - deadbabe7 months ago
    If you could solve the simpler problems, you’d be able to solve the larger problem. But you can’t, because you can’t even solve a simple problem.
  - samstave7 months ago
    It reads like the famous Churchill quote about "if you gave me poison I would drink it"
barteloniu7 months ago
Their test time RL approach seems a bit fishy. From what I understand, TTRL works by asking a language model to generate simpler versions of the test case. Once we have the simpler problems, we run RL on them, hoping that an improvement on the simplified cases will also strengthen the model performance on the original problem.
The issue is, they use a numerical integrator to verify the simpler problems. One could imagine a scenario where a barely simpler problem is generated, and the model is allowed to train on pretty much the test case knowing the ground truth. Seems like training on the test set.
The rest of the paper is nice though.
- thomasahle7 months ago
  > the model is allowed to train on pretty much the test case knowing the ground truth
  The task is to solve the integral symbolically, though, right?
  It's a hard problem to solve, even if the model is given access to a numerical integrator tool it can use on the main problem itself.
  - barteloniu7 months ago
    That's a fair point.
mentalgear7 months ago
> We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems
- RossBencina7 months ago
  Keep in mind that state of the art term-rewriting systems perform very well on symbolic integration, e.g.: https://rulebasedintegration.org/
  - jgalt2127 months ago
    Indeed. Don't the LLMs have access to the RUBI rule set? It's open source. Why are they not memorizing the rules?
    igravious7 months ago
    My hunch: It's better that we don't hard-code hard-won specialised knowledge. It's better that this tech can learn from more fundamental foundational principles that can be leveraged to apply to many many specialised domains.
    igorkraw7 months ago
    Nah, it's much simpler, the models aren't reliably able to recall the correct rule from memory - it's im the training set for sure.
    This is another specialized synthetic data generation pipeline for a curriculum for one particular algorithm cluster to be encoded into the weights, not more not less. They even mention quality control still beim important
    InvidFlower7 months ago
    Well, not sure if that part matters as much (from first principles). But the more important part being that RL lets a model figure out which methods are effective for it. Most of the time it probably has the tools already from pre-training, but doesn't "make the connection" to use them (or at least not often enough).
    samstave7 months ago
    F_yes
    --
    The thing to take here is that this should be a function_callable *feature* of a bot.
    Basically, when architecting a persona ; "USE THESE RULES OF ENGAGEMENT"
    Also -- where my CheckListManifesto folks at -- While we are building Patterns/Personas/Purgatories for our bots... We need to be able to reference a central CODEX of :
    "Do this task but imbue yourself with (XYZ) name places"
    --
    AND LEARN FROM THE OTHERS
    (so maybe a task marketplace of AI persona action?)
    @ callable in an IDE
niemandhier7 months ago
Frank Herbert knew it: This is basically an implementation of the mentats recursive self inspection described in Dune.
Davidzheng7 months ago
test-time training/RL is definitely the right approach for math AI in the future. It is probably one of only a few ways to spend an obscene amounts of compute at a given problem (think 10^5 gpus for a few days) and has hopes of making progress when test-time inference scaling may not at first (think if you try to do MCTS on a go position with a bad value/policy net). Alphaproof already did this but nice to see it done again--good results!
- Davidzheng7 months ago
  actually I think the interesting thing is to see how much of the boosted performance can be distilled back into the LLM at small sizes. Then you can emulate really how alphazero works b/c you have a policy improver (test time rl with similar problems). and we get to see just how strong a small net can be theoretically (say 32B or something)
  - InvidFlower7 months ago
    We've already seen Qwen's new QWQ 32B (not distilled) model doing impressive things on benchmarks. It'll definitely be interesting to see how just good small models can get. When combined with rag and large context window for expanded knowledge, might be able to get pretty far.
  - eru7 months ago
    This might be trivial, or deep, depending:
    We have done a lot of that improving historically by publishing research and textbooks. I can solve (some) problems today in minutes that would have stumped Isaac Newton for a life time (or at least a few weeks).
    Of course, you are hinting at a more general distillation, I suspect.
  - brookst7 months ago
    A little like the T2 read only versus learning mode switch: operating on intrinsic knowledge versus able to reason improvements.
neoneye27 months ago
Sidenote: `Tufa Labs` team includes the `MindsAI` team of ARC-AGI fame. https://tufalabs.ai/team.html
- ThouYS7 months ago
  nice!
pyryt7 months ago
Some names are just too tempting https://arxiv.org/abs/1507.02672
thomasahle7 months ago
At the end of the paper they mention "two problems from the 2025 MIT Integration Bee qualifying exam which the system continued to answere incorrectly".
They say the questions were among the most complex questions on the exam, but the first one is just
```
   ∫ ∛(x · ∜(x · ∜(x · √(x · √(x · ⋯ ))))) dx
```
which just requires you to compute
```
   1/3 + 1/(3*4) + 1/(3*4*5) + ...
```
So hardly very advanced math.
- Workaccount27 months ago
  It's a 7B model. So while the problem is not advanced the model is far from it too.
  - johntb867 months ago
    I'd be curious what would happen if you SFTed a larger model with successful reasoning traces from the smaller model. Would it pick up the overall reasoning pattern, but be able to apply it to more cases?
vessenes7 months ago
That this works at all is pretty interesting. That it seems to work very well with math is quite interesting.
That said, this paper is part of the move we have right now blurring the lines of training and inference -- part of their method involves doing some reinforcement learning on questions they don't know the answer to, but can decompose into simpler questions, and using GRPO on those with a numerical 'checker'. This reinforced model then can answer more questions.
I like this. I think humans do this a lot; mulling on something, turning it over in their heads, analogizing, etc. Adding test time training is a way to do a lot more thinking than adding tokens to the context for fixed inference.
Just as DeepSeek and o1/o3 show that we can increase capacity with inference-time-token generation and assessment, it looks like we can increase capacity with inference-time automated fine tuning as well.
I'd hope that as these techniques solidify we'll have a new way to talk and think about this -- they are all part of the same fundamental process at some level.
Either way, super cool.
mentalgear7 months ago
It's exciting to see approaches like RL and curriculum learning, which I always felt were the way to go for real self-improvement ~7y ago when training in robotics (openAI gym days), finally getting successfully applied to NLP/LLM to highly boost small model performance.
(Ladder is a sort of RL self curriculum learning approach)
- yu3zhou47 months ago
  Why is it the same as curriculum learning? I thought it's about starting with simple, more general samples and then progressing towards more difficult and complex during training?
- all27 months ago
  I know all of these words, but I do not know what they mean together.
  What is curriculum learning?
  What is the "RL" approach?
  "~7 ago": days? Weeks? Years?
  What is an "open ai gym days"?
  LLMs and robotics?
  - FeepingCreature7 months ago
    I fed the two comments verbatim into Grok:
    > Curriculum learning: Training that begins with easy examples, gradually increasing difficulty.
    > RL (Reinforcement Learning): Learning via trial-and-error with rewards, like training a robot or model to optimize actions.
    > ~7y ago: ~7 years ago (circa 2018).
    > OpenAI Gym days: Refers to using OpenAI Gym, a toolkit for RL, popular in robotics/AI research ~2016-2018.
    > LLMs and robotics: Large Language Models (LLMs) now leverage RL techniques from robotics for better performance.
    I think the last one is a semi-hallucinatory stretch. LLMs are large language models, ie. ChatGPT, Sonnet, Grok, R1. Robotics are ... robotics. Building robots.
    The actual answer to what the comment is saying is that until maybe a year back, we trained language models - still with RL, but with RL on token error, which isn't "real" RL because it executes tasks "by coincidence". That is, it happens to be that when you train a model to predict text, it also gains the ability to do tasks in the bargain, because the text contains agents that do tasks. A year or so ago, we started training models by having them do a task, judging if the task was successful or failed, and then performing RL on task outcome rather than token prediction. This is a return to "classic RL", but we had to pass through the "token RL regime" first so that the model could make progress on realistic tasks at all. It also means that LLMs can now increasingly be employed in robotics, where task RL training rules, as there is no massive preexisting robotics movements dataset like there is for text.
    (Also, NLP is Natural Language Processing, ie. what LLMs do.)
  - NitpickLawyer7 months ago
    Curriculum learning posits that you get better results if you gradually increase the training "difficulty". That is, learn to walk before you run. So you'd do "additions and multiplications first" and then "now draw the rest of the integral" :)
    RL - Reinforcement Learning. You have a carrot and a stick. You run a model through iterations (in LLMs you generate n completions), you score each of them based on some reward functions, and if the result is correct you give it a carrot (positive reward), if the result is incorrect you give it a stick (negative or 0 reward). (simplified ofc)
    OpenAI gym is (was?) an environment that allowed "AI agents" to be simulated in an environment. You could for example play games, or solve puzzles, or things like that. oAI gym was a "wrapper" over those environments, with a standardised API (observe, step (provide action), reward; rinse and repeat). You could for example have an agent that learned to land a lunar lander in a simple game. Or play chess. Or control a 3d stick figure in a maze.
  - katzenversteher7 months ago
    I can only help with RL, that's probably reinforcement learning. As far as I remember that means you let the model perform a task that can be "graded" and then depending on how well it did it get's a reward it want's to maximize. I believe (this it where I'm very insecure, I could be wrong) the neurons (weights / biases) of the neurons that where involved in reaching the highest reward get adjusted to have a bigger influece.
  - InvidFlower7 months ago
    I think you'll get a lot of use from this video from Andrej Karpathy: https://www.youtube.com/watch?v=7xTGNNLPyMI
    It is long, but don't get scared off. He goes over a ton of different stuff related to model training, but makes it very easy to understand.
flakiness7 months ago
Off topic, but their site is lovely: https://tufalabs.ai/index.html It feels like a gold rush for sure.
cratermoon7 months ago
How many rungs of a ladder would you be willing to climb if you knew that each rung was made from half the previous rung?
daxfohl7 months ago
How much GPU would an RL like this need for tuning? Is the approach something someone could experiment with themselves, or is it like thousands of USD in cloud costs and/or years of compute if done on a laptop GPU?
- ekidd7 months ago
  I've seen recent interesting papers on reasoning models with costs from US$6 to US$4,5000. The problem is that you need a bunch of fast RAM for efficient training. But you can do some limited fine-tuning (Q-LoRA, etc) of models up to 14G on a 24 GB graphics card, and full fine-tunes of 1.5G models.
  It's very affordable for a small university research group. And not totally out of reach for hobbyists.
  - daxfohl7 months ago
    Really $6? Where was that?
nis0s7 months ago
What’s the difference between this and what Wolfram Alpha has been doing?
https://www.wolfram.com/artificial-intelligence/
explosion-s7 months ago
I would love to be able to use the actual model! If I'm understanding correctly this makes small models as intelligent as much larger models like GPT4o
evjan7 months ago
I had NotebookLM make a 15 min podcast about it and listened to it while walking the dogs. It was a very interesting way of trying to understand a research paper!
You need a google account to access it unfortunately. https://notebooklm.google.com/notebook/fbaba495-d4f2-48a3-a3...
goyel7 months ago
I wonder why nobody made a NN to find the weigths faster and better than gradient descent
majordroid7 months ago
> We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance.
That's incredible!
- yangikan7 months ago
  Is there code for this paper or something that does something similar to this?
- Davidzheng7 months ago
  i mean they have a verifier, so can't they even get to 90% just by random generation by the net and testing against verifier until it's numerically correct? I think the end solve rate is less important and the generality of approach is maybe more important
  - vessenes7 months ago
    No, they specifically test for this (the "RL" case). They most particularly can not do this with random generation, which is very interesting.
    Davidzheng7 months ago
    but i mean it depends on how many attempts you let it generate. the right comparison is to use the test time rl compute to just do generation and compare success rates. (if you gen for long enough you eventually will hit the answer by chance)
revskill7 months ago
Llm keeps deleting my file content proved that we have far many things to do.
bloomingkales7 months ago
I’m kinda getting the sense this is still just prompt engineering in a loop.
Persona-based prompting: We prompted the model to adopt different mathematical perspectives (e.g., "think like Euler focusing on series", "approach like Gauss looking for patterns").
I mean … I guess that’s scientific?
Besides that, how can the model learn at test time (at inferencing)?. It’s stateless, it doesn’t incorporate the last prompt into the model.
- regularfry7 months ago
  It learns through the context. The context is state.
  There's a bit of a cheat that's going on here though in that the model is being given the fundamental integration operations as part of the problem. That means the model hasn't had to learn what they are. It might not have needed to be given them, but it does feel like that's giving the model a leg up in the benchmarks that it wouldn't otherwise have, and when there's a direct comparison to (e.g.) DeepSeek, that's an unfair advantage.
  - viking1237 months ago
    Does the massive context still needs to be dragged around? Until get a neural network that adjusts weights in real time without relying on a huge clump of context being cycled there, I don't think there will be an AGI or even an impressive "agent". Current agents are just LLM looping lmao sold to people with no knowledge how they work at all.
    regularfry7 months ago
    I've been wondering about context compression, actually. I remember a bunch of prompt compression tricks from a couple of years ago that must be usable. If you were to say "anything past half context gets compressed, including the previously-compressed context" would mean you'd still have reasonable workspace and potentially infinite recollection of the most important things. Then yes, you'd be dragging a massive context around, but you're maximising the return on using it. I presume somebody's got a toolkit for this that I don't know the right terms to google for.
    mirekrusin7 months ago
    Compress first half (even recursively) is a nice idea, but you'd have to sort it by importance first – ie. you don't know if the most important sentence didn't exist at the beginning (ie. the whole goal could have been stated at the beginning, you don't want to loose it – you want memory to fade in terms of importance instead).
    regularfry7 months ago
    I don't know that, but the LLM might be able to make that judgement. If it's involved in the compression process, it can choose to drop cruft from the middle. Maybe "summarise then compress" is a better description.
    bloomingkales7 months ago
    Yep, let the LLM operate like functional programming functions. Explode(context) … map, reduce, flatten, explode, filter, find, filter, map, flatten, explode … reduce(context).
    This is the imagination loop. OpenAI sold the prior reasoning loop. But that’s all this is.
    Quite frankly, I wouldn't come to HN for anything less than this quality of content:
    https://www.youtube.com/watch?v=DX3qLIwHoUo#t=1m29s
    The machine reforming visually (explode, .., reduce) is what I'm describing.
    InvidFlower7 months ago
    There's been some work on memory lately like Transformer² and Titans. But that may not be necessary for decent agents. Even the "context in a loop" is getting better as general reliability of tasks increases, and as people find better ways to manage context. Like Cline has the Memory Bank prompt where there's a folder of markdown files at diff levels of abstractions of info on the project, current feature, current task progress, etc. And so it can update those files and read from them as a consistent memory.
    mhmmmmmm7 months ago
    There is https://www.rwkv.com/ which is an LLM based on RNN's, thus having "infinite" context length, it comes with its own tradebacks though. (Notably that its impossible to actually store infinite information in the network, so it prunes based on which information it finds more important.)
    esafak7 months ago
    The "massive" context is still tiny compared with the size of the model whose weights you propose to update.
    7 months ago
    undefined
ma9o7 months ago
divide and conquer :)