Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.
You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.
(As a human, when I'm filling out a complex form, I'll often jump around the document)
Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].
[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.
Why no Opus 4.7? Why Gemini 3.1 Pro is missing?
If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.
When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.
Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.
For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.
Good point tho, will add this point in the blog too :)
Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.
If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).
Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).
While most models were great at producing JSON schema, they were pretty bad at producing accurate values.
In the graph you'll is almost a 20%-30% drop between the JSON schema pass vs the value accuracy.
Check out the paper section "6.3 Structured Decoding Ablation"
Paper: https://arxiv.org/pdf/2604.25359
We ran the comparison and saw no difference, so to keep the bench consistent since some models don't support structured decoding we used greedy decoding on all models.
> Our goal is to be the best general model for deterministic tasks
I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.
I am hopeful deterministic output will return, though; DeepSeek v4 claims to have implemented "bitwise batch-invariant and deterministic kernels," though I haven't tested it myself.
Reproducible does not mean deterministic. You cannot determine in advance what a prompt will give as output, even with a temperature of 0 and a fixed seed, therefore they are not deterministic.
Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.
> "don't use an LLM"
Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.
The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.