5 pointsby TimoKerr3 hours ago1 comment
  • TimoKerr3 hours ago
    Hi HN,

    TLDR: Cheap (and sometimes old) models perform on par, or better than flagship models on standard OCR tasks, at a fraction of the cost. This conclusion comes from a benchmark we ran on 18 models and over 7k+ LLM calls. Leaderboard and benchmark repo completely open-source.

    Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

    So we investigated the topic and open-sourced everything, including a free tool to check your own documents.

    We ran 18 models from OpenAI, Anthropic, Google, and Mistral on 42 real-world documents (invoices, receipts, bills of lading, transport orders). Each model ran 10 times per document to measure reliability, not just one-shot accuracy; 7,560 API calls total.

    The finding: for standard document extraction, mid-tier and older models match or beat state-of-the-art, at a fraction of the cost. In some cases the cost difference is multiple orders of magnitude for equivalent accuracy.

    We also track pass^n (how reliability degrades over repeated runs, see tau-bench), cost-per-success (not just cost-per-token), and critical field accuracy. Full methodology and dataset are open source.

    Leaderboard: <https://www.arbitrhq.ai/leaderboards/>

    Dataset + framework (GitHub): <https://github.com/ArbitrHq/ocr-mini-bench>

    Or test your own documents for free: <https://app.arbitrhq.ai/benchmark-free>

    Built by two founders in Antwerp. Very curious if other people have similar conclusions or if you've seen specific edge cases where the flagships still justify their price tag?