How to benchmark persistent repo memory for coding agents(autoloops.ai)

2 pointsby kushalpatil073 hours ago1 comment

kushalpatil073 hours ago
Greplica is a context layer for your coding agents. It stores info about your current architecture, decisions, nuances etc from your code and sessions, and gives it to your agent before it starts exploring. This information is something that you would explain to a dev on how a particular thing works. Idea is if we are able to maintain this information, the agent will not need to grep through a 100 files to discover the same thing, and save tokens/time, and using prior decision history improve on coding itself.
Benchmark is created from SWE-Chat dataset, which are real coding sessions of users on open source projects.
The benchmark setup is temporal:
take prior coding-agent sessions from a repo
build memory only from those prior sessions
hold out a later session from the same repo
run the same planning task at the same pre-task commit
compare baseline vs memory-assisted agent
The held-out session is not used while building memory.
The agent only gets access to repo memory created from earlier work: architectural facts, subsystem behavior, gotchas, failed attempts, implementation notes, constraints, etc. Each memory item is tied back to evidence from files/commits/sessions.
On the selected 10 high-context planning tasks, Greplica reduced:
cost by 43%
tokens by 49%
tool calls by 36%
elapsed planning time by 26%
Tried to benchmark on coding tasks as well, but that becomes difficult because coding trajectories can vary a lot, an agent might end up running tests each time it codes, the other may not.
There were other interesting results as well. Not perfected but would love to share.
Variance:
Running the same task multiple times without memory can produce very different planning traces.
Sometimes the agent finds the right subsystem quickly.
Sometimes it burns a lot of tokens exploring irrelevant files, gets anchored on the wrong abstraction, or only discovers the important context late in the run.
That makes single-run agent benchmarks pretty noisy.
Memory seems to reduce this variance because the early part of planning changes. The agent is no longer doing broad repo archaeology from zero. It starts with a smaller set of relevant claims, then uses repo exploration to verify and fill gaps.
Greplica vs docs-folder
The second thing we are benchmarking now is Greplica vs a docs-folder baseline.
The obvious baseline is:
“Why not just write all prior session memory into markdown files and let the agent read them?”
At small docs sizes, this actually works quite well.
Quality is similar. Token usage is also similar. There are only a few files, so the agent can cheaply scan them.
But as more sessions are ingested, docs-folder goes to shit. Seen in cases where ingested sessions changed from 3 to 11.
Greplica improves because there is more prior engineering context to retrieve from, and there is an optimized retrieval pipeline that gets you relevant stuff.
The docs folder gets worse on token usage because it slowly becomes another codebase. The agent now has to search the docs, rank relevance, detect stale notes, resolve conflicts, and decide which facts to consider.
So the bottleneck moves from storage to retrieval. This slowly turns to a retrieval problem.
Repo: https://github.com/Autoloops/greplica