Interfaze: A new model architecture built for high accuracy at scale(interfaze.ai)

92 pointsby yoeven6 hours ago11 comments

schanz2 hours ago
Amazing!
I just tried the OCR capabilities with a photo of a DIN A4 page which was written with a typewriter. The image isn't the easiest to interpret. The text perspective is distorted because the page is part of a book and the page margin toward the spine of the book is very small. There are also many inline corrections due to typing errors while the page was written (backspace couldn't erase characters back then, and arrow keys couldn't be used to add text in between existing words). Over the past months I've tried to use several LLMs on this very same image already (1 out of 200 pages that seek digitization). The result is by far the most accurate so far. Only some very minor errors (which are also non-trivial for human translators) were made.
This page induced costs of about 25 cent. I assume I could tweak the input image a little more to consume less input tokens. OCR-ing all 200 pages would otherwise cost a juicy 50$ - although there is a generous 20$ of free credits.
Induced cost: 108.8k Input tokens => 16,32 cent 24.5k Output tokens => 8,58 cent
// Edit: I just re-tried the same task utilizing a capability of the API to only run a specific part of the model (e.g. _only_ OCR). This cuts cost by 3x (to ~8c/page) but significantly worsens the result. The result is missing entire lines of the original document. There are also many error in the text that was recognized.
- AnthonyRan hour ago
  Have you tried this task using an actual OCR model like Google Cloud Vision AI? I am not sure if this is what Gemini uses under the hood but multi-modal LLMs are not designed to extract text like this so it should be no surprise it's not good at it?
euroderf4 hours ago
Potentially stupid question: Does that mean we can chain them together line UNIX command line programs ? That would be so, so intuitive.
wood_spirit5 hours ago
> These are deep neural network architectures that are task-specific for things like OCR, translation, or GUI detection. The way they consume and see data is trained to be task specific, which makes them up to 100x more accurate at their specific task. They also produce useful metadata like bounding boxes and confidence scores, letting developers build predictable workflows they can rely on.
Does code extraction and manipulation fit in that? Would interfaze be the agent that a coding agent uses?
- yoeven3 hours ago
  Code extraction maybe, not something we have tested or built for but you could give it a try.
  Code manipulation probably not since it's a lot smaller of a model compared to a Claude Opus which is SOTA for code generation/manipulation.
  Generally code generation is a non-deterministic task by nature and general LLMs tend to be better at them.
  - wood_spirit3 hours ago
    The idea of what to change is perhaps an llm task but the job of doing the find replace and that kind of tooling is something LLMs actually struggle with and have all kinds or crutches and try retry loops to paste over in coding agents etc.
andai3 hours ago
This is very cool, though I don't understand exactly what they've done here. Is it some kind of LLM with convolutional layers added?
The graph doesn't exactly make it clear but it describes a pipeline that goes beyond the LLM, so the CNN could be a separate model there.
- tomsyourunclean hour ago
  Here’s the academic paper behind it: https://arxiv.org/abs/2602.04101
fraywing3 hours ago
So is this basically a task-specific MoA transformer arch with a DNN that helps make routing decisions? Trying to understand this.
florians2 hours ago
What I want are precise and tight bounding boxes. Why is this so difficult?
- philipkglass2 hours ago
  The PP-DocLayoutV3 [1] bounding boxes are pretty good in my experience, if you want boxes around individual document headings or paragraphs. If you want boxes around individual words, similar to what's shown in the Interfaze screen shot [2], Apple has a LiveText "token" model that's proprietary but free/bundled with macOS and iOS. There are easy to use Python bindings here: https://github.com/straussmaximilian/ocrmac
  I presume that some otherwise-great OCR models (like Chandra) have terrible bounding boxes because generating good bounding boxes just wasn't a training priority. A lot of people are using OCR models to bulk-process documents without a lot of care for how the layout is preserved. It matters a lot if (e.g.) you want to be able to update and re-print old documents, but it doesn't matter if you are just transcribing whole documents for indexing/chunking/translation.
  [1] https://huggingface.co/PaddlePaddle/PP-DocLayoutV3
  [2] https://r2public.jigsawstack.com/interfaze/examples/dense_te...
sareiodata5 hours ago
Smaller models really arent great at structured output. If this works it would be great for a local model that might not be as good but as long as it respects structured output will be vastly more useful.
- yoeven3 hours ago
  We have a full benchmark breakdown specifically on structured output that you can take a look at https://interfaze.ai/leaderboards/structured-output-benchmar...
- OutOfHere4 hours ago
  > Smaller models really arent great at structured output.
  That doesn't seem to hold true. Consider gpt-5.4-nano which supports structured output just fine.
  https://developers.openai.com/api/docs/models/gpt-5.4-nano
  It seems like a concern that's orthogonal to the model size.
  - nosyke4 hours ago
    I genuinely doubt that they are just lying though lol
sweaterkokuro5 hours ago
This is cool, Id love to be able to fine tune on this architecture. Is this something on the roadmap ever?
- yoeven3 hours ago
  It isn't on our roadmap right now since in most cases it should work out of the box and if it doesn't we'll work with you to train that into the model generally.
  However, if we see enough people who has something super niche that our model can't handle, we might start considering a fine tuning service
icemaze3 hours ago
Great in the benchmarks but not as good in the real world, sorry to say. Just gave it a try in my STT bot, it's worse than whisper
redwood3 hours ago
Similar to a large action model?
a7om_com5 hours ago
[flagged]