Gitar runs multiple specialized AI agents on every code change. They review code, fix CI failures, execute custom repository rules as workflows, and respond to developer feedback in-thread. That's easily 50-100 LLM calls per PR, and complex ones can hit 500+. We tried swapping Claude for Kimi K2.5 at 1/5th the price over a weekend.
Three things bit us: finish_reason semantics differ between "compatible" providers, the model retried identical failing tool calls instead of adapting, and provider failover invalidated prompt caches on both sides.
Curious if others have hit similar issues.