The part I care about most: every 5 minutes, a loop scores each agent response on confidence. Low-confidence ones get flagged for you to review. When you correct something, that correction goes into the agent's context for future responses. Not fine-tuning -- just feeding corrections back as structured context. After a few days, the agent stops repeating the same bad answers.
I'm dogfooding it with an agent called Scribe that posts to X for me. Scribe was terrible for the first ~80 interactions. Now it's mostly fine. The cold start period is real and I haven't figured out how to shorten it.
What works: Telegram responses in under 2 seconds. Swap between GPT-5.2, Claude Opus 4.6, and Gemini without reconfiguring. The feedback loop does what I wanted.
What doesn't: Discord and WhatsApp aren't hooked up. No way to export learned corrections (lock-in problem I need to solve). Observability dashboard exists but only I can see it right now.
$29/mo, $10 in AI credits included, 14-day trial. Stack is Node.js on Fly.io.
Curious about the confidence-scoring approach. Anything above 0.8 auto-approves, below gets queued for human review. Should I give users that threshold control, or is one knob enough?
Re: the cold start problem -- have you tried seeding with a few manually-written 'ideal response' examples instead of starting from zero? In my experience with agent management, giving agents even 5-10 reference outputs dramatically reduces the ramp-up period. Essentially turning the cold start into a warm start.
The corrections-as-structured-context approach (vs fine-tuning) is the right call for this stage. Fine-tuning is expensive and brittle. Structured context corrections are portable, inspectable, and you can version them. That also solves your lock-in concern -- the corrections are just data, export them as JSON.
One suggestion: consider adding a 'correction categories' feature. After a while, you'll notice patterns (tone too formal, wrong audience assumptions, missing context). Categorizing corrections could let you surface systemic issues rather than fixing one-offs.