It has now become fashionable to claim much, and furnish little.
It has now become fashionable to fail to understand or state the core of your proposal in as few words as possible: instead of "genetic algorithm applied to the space of harnesses, parallelized by our infrastructure" we get "Three swaps. Same orchestrator. Same dashboard. The wiring is the thing."
We're cooked chat.
It really shines through in pieces like this that LLM's have a severely constrained worldview and underdeveloped theory of mind. They can't imagine that a line like "A 200-line POC that goes from 0/5 to 5/5 in four proposer steps" means nothing to me as a subtitle for the page. After all "proposer steps" and "5/5" are *right there* in it's context. Surely everyone has "proposer steps" in their context, right?
Have to dig into the code, but it looks like they have sound engineering around a "self-improving" agentic coding harness. Will be fun to take the code for a spin.
> Is the word "racecar" a palindrome? Answer with exactly one lowercase word: "yes" or "no". Print only the answer.
It took a few tries to figure out what to test in the first place, since it is not obvious what the workflow should improve (prompt? guided agent ability?).
So, the only meaningful test I ended up with was giving easy tasks, but with a deliberately misleading/incomplete prompt, then testing whether persisting deltas and observations between successive prompts meaningfully improves a meta-agent's ability to correct the imprecise prompt (what I mean by "prompt-repair discipline and audit trail") [2].
From a couple more experiments (summarized in [2]), I found that the Meta-agent does not really have an effect on how well the guided agents perform, but simply improves imprecise prompts better.
My conclusion is that this method works to improve bad prompts, I didn't demonstrate that improves guided agent capabilities. However, I think it's better to work on your prompts before giving them to agents instead of giving bad prompts and iterating on them with a meta-agent.
[0]: https://github.com/ouatu-ro/skill-distillery/blob/main/skill...
[1]: https://github.com/zozo123/meta-harness-on-islo
[2]: https://github.com/ouatu-ro/skill-distillery/blob/main/repor...
One of my own insights here is that you need to collect not just execution traces, but all the human-in-the-loop nudges and steering commands. They are one of the purest sources of feedback on coding agents when seen in context.
I agree with OP on the need to collect traces and compare them, not just scores. It is a much richer source of feedback.
If anyone is interested I have a slide deck about my approach: https://horiacristescu.github.io/claude-playbook-plugin/docs...
How does this go above and beyond this straightforward opensource, open weights and relatively cheap setup? Do you just get more tokens from SOTA models? Can anyone rationally say the products of token production are quality and secure?
However, the problem with self-modification is the tendency towards inoperable states. Does it automatically revert when a detrimental state is reached? How does it determine that a modification worked?