I used autoresearch to improve my AGENTS.md, measured against real tasks(www.stet.sh)

7 pointsby bisonbear4 hours ago3 comments

joshkaan hour ago
If you look at the 95% CI on https://marginlab.ai/trackers/codex/ with N=50, it's still pretty huge (+/- 13-14% usually). I suspect it would be difficult to reasonably get a measure that numerically assesses whether an AGENTS.md is good. What you can observe though is whether the model paid attention to certain rules while editing. I.e. did the behavior you're steering away or towards take place.
The hardest thing I think is judging whether your AGENTS.md is still good based on each model release. OpenAI does release prompting guidance however to help this (and have added a skills to apply this to your prompts IIRC)
fuckinpuppersan hour ago
I had a blast having all the major models figure out the most optimal strategy for itself inside of Cursor, with cursorrules, AGENTS.md, .cursor/rules/ mrd files or whatever and learned some interesting things, how it won’t guarantee every instruction even when it’s told to, for example
Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.
I must have spent a couple hundred on the company dime having the models rephrase/rewrite or change where instructions were found, what made sense as a skill vs a rule, trying to keep things as portable as possible. At this point the Cursor-specific files would need to be ported to a different agent/framework if it needed to be. But the content should be pretty solid.
It was an interesting (and productive) exploration for me
jauntywundrkind37 minutes ago
The fine tuning where we run tests/experiments again and again and again on our prompts, our set-ups: really looking forward to when we can start to compare our amalgamated rigs and harnesses and prompts, all these systems. We are guided by intuition, a desire for structure & clarity & direction we think we add. But we lack common tools to assess and compare.
And even when we do compare, the thermal values, the entropy of our systems: that alone can lead us down very different paths. Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)