Show HN: Calx – track and compile corrections humans make with AI agents(github.com)

1 pointby spenceships3 hours ago1 comment

spenceships2 hours ago
Here's some context on how this happened:
The origin was accidental. I was building a startup (AI career translation platform), not running an experiment. The correction logs were just how I managed the agents.
When the transfer failed, honestly it didn't occur to me that I had measured it at all until well after. I was pivoting the platform to go fully agentic and had burned through 1.9B tokens in 4 days or something. So, I did an audit to see what fell through the cracks. The audit was when I began realizing what I had found. At that point the paper just made sense, because I hadn't seen anyone else talking about it.
What surprised me the most: architectural corrections (changing how something is structured) had zero recurrence. Process corrections ("always do X before Y") had roughly 50% persistence, with recurring failure chains. One correction chain went eight entries deep, each referencing the previous ones. The agent kept making the same category of mistake with slight variations.
HyperAgents landing the same week I was writing this up was genuinely lucky timing, and I didn't find out about it until last week. In my opinion, their imp@50 = 0.630 on math (where traditional transfer scored 0.0) is the clearest evidence that the mechanism vs documentation distinction is real and measurable.
What I'd love feedback on:
Is the MCP server the right distribution mechanism, or do people want this as IDE plugins? I have always strongly believed in meeting people where they are when it comes to Open Source, but I'm curious what this community thinks The recurrence detection uses Jaccard similarity on keyword sets. This is simple and works for my data, but I suspect it breaks on large teams. Anyone have experience with correction clustering at scale? The paper methodology is N=1. HyperAgents converged on the boundary but it doesn't account for everything. I know the limitations. If anyone wants to replicate with their own correction logs, the framework is designed for it and I'd actively help. I am quite eager to have people mess around with the tool and let me know their thoughts
As a note, I am still in the process of shipping the hook and orchestration methodology to work with the MCP server, and at the time of this writing I'm about a third of the way through the build. Am hoping to have it live and packaged by morning EST.
Happy to answer questions about the correction dynamics, the MCP architecture, or anything in the paper.