No idea if our prompting strategy was inefficient or if everyone was paying this much.
Built a quick benchmarking tool: https://local001.com/tokens
Submit your weekly spend + provider + use case → see your percentile + comparisons.
The dataset is early — it gets more useful the more people submit. But here's why I built this:
We're spending $1,100/week on Anthropic for a mix of coding agents and personal assistant tasks. I have no idea if that's normal or insane. Specifically:
Are we overspending by use case? Our coding agent burns ~$700/week and the assistant tasks burn ~$400. But I don't know what "good" looks like. Is $700/week for an agentic coding workflow competitive? Are teams doing similar work at $200? $2,000? There's zero public data on this.
Are we overspending on Anthropic? We're all-in on Claude right now. For coding tasks, maybe that's the right call. But for assistant/chat workflows — should we be routing half of that to GPT-4o or Gemini and cutting costs 60%? I genuinely don't know, and I haven't seen anyone publish real cost comparisons by task type, not just benchmark scores.
That's what this tool is for. Submit your weekly spend, provider, and use case → see where you land. If 50 teams submit data, we'll finally have a real answer to "is Anthropic worth the premium for X?"
Open questions:
Should we track tokens/$ instead of just $?
Should we separate o1/reasoning models vs base models?
How do you benchmark "efficiency" vs raw spend?
Built with Next.js + Cloudflare Workers + D1. Submissions are anonymous (just hashed IPs).
Long-term goal: use this data to negotiate bulk API rates with Anthropic/OpenAI/Google.
How would you improve this?