Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.
You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .
The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.
Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.
For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.
Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.
I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?
[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...
- Every document has ground truth text, a JSON schema, and the ground truth JSON.
- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema
- Compare the predicted JSON against the ground truth JSON for accuracy.
In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.
Edit - I see it on the Benchmark page now. Woof, low 70% scores in some areas!
And it happened with a lot of full documents as well. Ex: most receipts got classified as images, and so it didn't extract any text.
I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.
There are a few different benchmark types in the marker repo:
- Heuristic (edit distance by block with an ordering score)
- LLM judging against a rubric
- LLM win rate (compare two samples from different providers)
None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.
to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema
Dynamic Schema API API is running. See documentation for available endpoints.
Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption
To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?
In that way, LLMs are more human than, say, a database or a book containing agreed-upon factual information which can be directly queried on demand.
Imagine if there was just ONE human with human limitations on the entire planet who was taught everything for a long time - how reliable do you think they are with information retrieval? Even highly trained individuals (e.g. professors) can get stuff wrong on their specific topics at times. But this is not what we expect and demand from computers.
https://i.imgur.com/jcwW5AG.jpeg
For the blocks in the center, it outputs:
> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l'Aubespine, morte le 6 octobre 1725.
This is perfect! But then the next one:
> Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.
This is really bad because
1/ a portion of the text of the previous bloc is repeated
2/ a portion of the next bloc is imported here where it shouldn't be ("Cressonsac"), and of the right most bloc ("Chastelet")
3/ but worst of all, a whole word is invented, "mortilhomme" that appears nowhere in the original. (The word doesn't exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and "feel right" in the context.)
(Correct text for the second bloc should be:
> Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.)
Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”. These are https://fr.wikipedia.org/wiki/Adverbe_ordinal#Premiers_adver....
There’s also extra spaces after the “1607” and around the hyphen in “Diane-Henriette”.
Lastly, U+2019 instead of U+0027 would be more appropriate for the apostrophe, all the more since in the image it looks like the former and not like the latter.
Or degree symbol. Although it should be able to figure out which to use according to the context.
I don’t think you need a reasoning model for that, just better training; although conversely a reasoning model should hopefully notice the errors — though LLM tokenization might still throw a wrench into that.
A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text.
And specifically for the “white circle” character, it would be difficult to correctly infer the original ordinal markers after the fact. I myself could only do so by inspecting the original image, i.e. by having my brain redo the OCR.
I suppose that depends on why it's wrong. Did the model accurately read a real typo in the image or did it incorrectly decipher a character? If a spelling & grammar pass fixes the latter, isn't it valid?
For example, from a clear image of a printed page (in a standard font), it will give me 'cornprising' instead of 'comprising'; 'niatter' instead of 'matter'. Excepting the spell-check underline they'd be hard to spot as with relatively tight kerning all the errors look like the originals.
I'm surprised as 1) I've not had these sorts of errors before, 2) they're not words, and words must be heavily weighted for in the OCR engine (I'd have thought).
https://i.imgur.com/1uVAWx9.png
Here's the output of the first paragraph, with mistakes in brackets:
> drafts would be laid on the table, and a long discussion would ensue; whereas a Committee would be able to frame a document which, with perhaps a few verbal emundations [emendations], would be adopted; the time of the House would thus be saved, and its business expected [expedited]. With regard to the question of the comparative advantages of The-day [Tuesday]* and Friday, he should vote for the amendment, on the principle that the wishes of members from a distance should be considered on all sensations [occasions] where a principle would not be compromised or the convenience of the House interfered with. He hoped the honourable member for the Town of Christchurch would adopt the suggestion he (Mr. Forssith [Forsaith]) had thrown out and said [add] to his motion the names of a Committee.*
Some mistakes are minor (emnundations/emendations or Forssith/Forsaith), but others are very bad, because they are unpredictable and don't correspond to any pattern, and therefore can be very hard to spot: sensations instead of occasions, or expected in lieu of expedited... That last one really changes the meaning of the sentence.
If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?
Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.
It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.
Lots of potential here.
A tangent but this exact issue is what I was frustrated for a long time with pdf reader and reading science papers. Then I found sioyek that pops up a small window when you hover over links (references and equations and figures) and it solved it.
Granted, the pdf file must be in right format, so OCR could make this experience better. Just saying the UI component of that already exist
A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.
For cars we need accuracy at least 99.99% and that's very hard.
I guess something like success rate for a trip (or mile) would be a more reasonable metric. Most people have a success rate far higher than 99% for averages trips.
Most people who commute daily are probably doing something like a 1000 car rides a year and have minor accidents every few years. 99% success rates would mean monthly accidents.
However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.
You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.
But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).
From the comments here, it certainly seems that for general OCR it's not up to snuff yet. Luckily, I don't have great ambitions.
I can see this working for me with just a little care upfront preprocessing now that I know where it falls over. It casually skips portions of the document, and misses certain lines consistently. Knowing that I can do a bit massaging, and feed it what I know it likes, and then reassemble.
I found in testing that it failed consistently at certain parts, but where it worked, it worked extremely well in contrast to other methods/services that I've been using.
-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.
All of the above are things companies - particularly larger ones - are happy to pay for, because ORC is just a cog in the machine, and this makes it more reliable and predictable.
On top of the above, there are auxiliary value-adds such a vendor could provide - such as, being fully compliant with every EU directive and regulation that's in power, or about to be. There's plenty of those, they overlap, and no one wants to deal with it if they can outsource it to someone who already figured it out.
(And, again, will take the blame for fuckups. Being a liability sink is always a huge value-add, in any industry.)
Also the collab link in the article is broken, found a functional one [2] in the docs.
[1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...
In any case, thanks for sharing.
- Referenced Vision 2030 as Vision 2.0. - Failed to extract the table; instead, it hallucinated and extracted the text in a different format. - Failed to extract the number and date of the circular.
I tested the same document with ChatGPT, Claude, Grok, and Gemini. Only Claude 3.7 extracted the complete document, while all others failed badly. You can read my analysis here [2].
1. https://rulebook.sama.gov.sa/sites/default/files/en_net_file... 2. https://shekhargulati.com/2025/03/05/claude-3-7-sonnet-is-go...
Pricing : $1/1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?
Anyway this looks great at pdf to markdown.
Our tool, doctly.ai is much slower and async, but much more accurate and gets you the content itself as an markdown.
for all those people that aren't just clicking on a link on their social media feed, chat group, or targeted ad
e.g. you submit multiple requests (pdfs) in one call, and get back an id for the batch. You then can check on the status of that batch and get the results for everything when done.
It lets them use their available hardware to it's full capacity much better.
Regardless excited to see more and more competition in the space.
Wrote an article on it: https://www.sergey.fyi/articles/gemini-flash-2-tips
I wasn't there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they'd tried, from brute-force training on templates (didn't scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..
I have to imagine this is a problem shared by so many companies.
(It turns out my electric meter, though analog, blasts out it's reading on RF every 10 seconds unencrypted. I got that via my RTL-SDR reciever :) )
Happiness for me is about exploring the problem within constraints and the satisfaction of building the solution. Brittleness is often of less concern than the fun factor.
And some kinds of brittleness can be managed/solved, which adds to the fun.
https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
https://github.com/tesseract-ocr/tesseract
https://community.home-assistant.io/t/ocr-on-camera-image-fo...
https://www.google.com/search?q=home+assistant+ocr+integrati...
https://www.google.com/search?q=esphome+ocr+sensor
https://hackaday.com/2021/02/07/an-esp-will-read-your-meter-...
...start digging around and you'll likely find something. HA has integrations which can support writing to InfluxDB (local for sure, and you can probably configure it for a remote influxdb).
You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power Management / humidity management / waterproof electrical box" to stuff it into, and then either YOLO and DIY to shoot over to your influxdb, or set up a Home Assistant and "attach" your frankenbox as some sort of "sensor" or "integration" which spits out metrics and yadayada...
The time cost is so low that you should give it a gander. You'll be surprised how fast you can do it. If you just take screenshots every minute it should suffice.
Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.
Edit2: here's a picture of what I'm talking about, the before/after: https://ibb.co/v6xcPMHv
https://mistral.ai/fr/news/mistral-ocr
to
https://mistral.ai/news/mistral-ocr
The article is the same, but the site navigation is in English instead of French.
Unless it's a silent statement, of course. =)
[0] https://i.imgur.com/JiX9joY.jpeg
[1] https://chat.mistral.ai/chat/8df2c9b9-ee72-414b-81c3-843ce74...
Mistral OCR:
- 72.2% accuracy
- $1/1000 pages
- 5.42s / page
Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.
https://github.com/getomni-ai/benchmark
The 95% from their benchmark: "we evaluate them on our internal “text-only” test-set containing various publication papers, and PDFs from the web; below:"
Text only.
I realize it’s not a common business case, came across it testing how well LLMs can solve simple games. On a side note, if you bypass OCR and give models a text layout of a board standard LLMs cannot solve Scrabble boards but the thinking models usually can.
I'll give it a try, but I'm not holding my breath. I'm a huge AI Enthusiast and I've yet to be impressed with anything they've put out.
Q: Do LLMs specialise in "document level" recognition based on headings, paragraphs, columns tables etc? Ie: ignore words and characters for now and attempt to recognise a known document format.
A: Not most LLMs, but those with multimodal / vision capability could (eg DeepSeek Vision. ChatGPT 4). There are specialized models for this work like Tesseract, LayoutLM.
Q: How did OCR work "back in the day" before we had these LLMs? Are any of these methods useful now?
A: They used pattern recognition and feature extraction, rules and templates. Newer ML based OCR used SVM to isolate individual characters and HMM to predict the next character or word. Today's multimodal models process images and words, can handle context better than the older methods, and can recognise whole words or phrases instead of having to read each character perfectly. This is why they can produce better results but with hallucinations.
Q: Can LLMs rate their own confidence in each section, maybe outputting text with annotations that say "only 10% certain of this word", and pass the surrounding block through more filters, different LLMs, different methods to try to improve that confidence?
A: Short answer, "no". But you can try to estimate with post processing.
Or am I super naive, and all of those methods are already used by the big commercial OCR services like Textract etc?
What about rare glyphs in different languages using handwriting from previous centuries?
I've been dealing with OCR issues and evaluating different approaches for past 5+ years at a national library that I work at.
Usual consensus is that widely used open source Tesseract is subpar to commercial models.
That might be so without fine tuning. However one can perform supplemental training and build your own Tesseract models that can outperform the base ones.
Case study of Kant's letter's from 18th century:
About 6 months ago, I tested OpenAi approach to OCR to some old 18th century letters that needed digitizing.
The results were rather good (90+% accuracy) with the usual hallucination here and there.
What was funny that OpenAI was using base Tesseract to generate the segmenting and initial OCR.
The actual OCRed content before last inference step was rather horrid because the Tesseract model that OpenAi was using was not appropriate for the particular image.
When I took OpenAi off the first step and moved to my own Tesseract models, I gained significantly in "raw" OCR accuracy at character level.
Then I performed normal LLM inference at the last step.
What was a bit shocking: My actual gains for the task (humanly readable text for general use) were not particularly significant.
That is LLMs are fantastic at "untangling" complete mess of tokens into something humanly readable.
For example:
P!3goattie -> prerogative (that is given the surrounding text is similarly garbled)
Typical OCR pipeline would be to pass the doc through a character-level OCR system then correct errors with a statistical model like an LLM. An LLM can help correct “crodit card” to “credit card” but it cannot correct names or numbers. It’s really bad if it replaces a 7 with a 2.
Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.
Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.
I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.
From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:
model | omni | mistral |
gemini | 86% | 89% |
azure | 85% | 89% |
gpt-4o | 75% | 89% |
google | 68% | 83% |
Currently adding the Mistral API and we'll get results out today!
[1] https://github.com/getomni-ai/benchmark
[2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Mistral OCR:
- 72.2% accuracy
- $1/1000 pages
- 5.42s / page
Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.
https://github.com/getomni-ai/benchmark
We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.
We simply cannot afford to give our customers incorrect informatiin
We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.
Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.
If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.
It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.
The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.
I will still check it out, but given the performance I already have for my specific use case with my current system, my upfront expectation is that it probably will not make it to production.
I'm sure there are other applications for wich this could be a true enabler.
I am also biased to using as little SaaS as possible. I prefer services on-prem and under my control where possible.
I do use GPT-4o for now as, again, for my use case, it significantly outperformed other local solutions I tried.
IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.
e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.
But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.
Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour
I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.
There are about 47 characters on average in a sentence. So does this mean it gets around 2 or 3 mistakes per sentence?
Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.
The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.
The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.
Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.
It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.
Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.
At least that is what I imagine the tech would evolve into in 5+ years.
Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.
Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.
But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)
The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.
So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)
If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)
Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?
It isn't the tech that's the problem but the people that will abuse it.
It is specifically how you describe using the tech that provokes a feeling of revulsion to me.
I am saying that this type of system, that deprives the user of problem solving, is itself a problem. A detriment to the very essence of human intelligence.
If you are expecting problems to be solved for you, you are not learning, you're just consuming content.
Seems like https://aiscreenshot.app might fit the bill.
ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").
Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.
The notebook link? An ACL'd doc
The examples don't even include a small text-to-markdown sample.
The before/after slider is cute, but useless - SxS is a much better way to compare.
Trying it in "Le Chat" requires a login.
It's like an example of "how can we implement maximum loss across our entire funnel". (I have no doubt the underlying tech does well, but... damn, why do you make it so hard to actually see it, Mistral?)
If anybody tried it and has shareable examples - can you post a link? Also, anybody tried it with handwriting yet?
CNNs trained specifically for OCR can run in real time on as small compute as a mobile device is.
https://www.soundslice.com/sheet-music-scanner/
Definitely doesn't suck.
Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.
Edit: answered in another post: https://huggingface.co/spaces/echo840/ocrbench-leaderboard
If I remember right, Gemini actually was the closest as far as accuracy of the parts where it "behaved", but it'd start to go off the rails and reword things at the end of larger paragraphs. Maybe if the image was broken up into smaller chunks. In comparison, Mistral for the most part (besides on one particular line for some reason) sticks to the same number of words, but gets a lot wrong on the specifics.
Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.
https://www.adobe.com/uk/acrobat/resources/document-files/pd...
For example, if you print a word doc to PDF, you get the raw text in PDF form, not an image of the text.
Why Jokes aside? Markdown/html is better suited for the web than pdf
All OCR tools that I have tried have failed. Granted, I would get much better results if I used OpenCV to detect the label, rotate/correct it, normalize contrast, etc.
But... I have tried the then new vision model from OpenAI and it did the trick so well it's wasn't feasible to consider anything else at that point.
I have checked all S/N afterwards for being correct via third-party API - and all of theme were. Sure, sometimes I had to check versions with 0/o and i/l/1 substitutions but I believe these kind of mistakes are non-issues.
I'm hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.
I signed up for the API, cobbled together from their tutorial (https://docs.mistral.ai/capabilities/document/) -- why can't they give the full script instead of little bits?
Tried uploading a tiff, they rejected it. Tried upload JPG, they rejected it (even though they supposed support images?). Tried resaving as PDF. It took that, but the output was just bad. Then tried ChatGPT on the original .tiff (not using API), and it got it perfectly. Honestly I could barely make out the handwriting with my eyes but now that I see ChatGPT's version I think it's right.
The first couple of sections are for pdfs and you need to skip all that (search for "And Image files...") to find the image extraction portion. Basically it needs ImageURLChunk instead of DocumentURLChunk.
Because of that I'm stuck with crappy vision on Ollama (Thanks to AMDs crappy ROCm support for Vllm)
The best products will be defined by everything "non-AI", like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.
Lots of big companies don't like change. The existing document processing companies will just silently start using this sort of service to up their game, and keep their existing relationships.
The challenge isn't so much the OCR part, but just the length. After one page the LLMs get "lazy" and just skip bits or stop entirely.
And page by page isn't trivial as header rows are repeated or missing etc.
So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don't seem to have a one-shot solve for that. Maybe this is it?
1. I was initially thinking this is VLM parsing model until I saw it can extract images. Then, I assume it is a pipeline of an image extraction and a VLM model while their result is combined to give the final result.
2. In this case, benchmark the pipeline result vs a end to end VLM such as gemini 2.0 flash might not be apple to apple comparison.
So bad that I think I need to enable the OCR function somehow, but couldn't find it.
A high level diagram w/ links to files: https://eraser.io/git-diagrammer?diagramId=uttKbhgCgmbmLp8OF...
Specific flow of an OCR request: https://eraser.io/git-diagrammer?diagramId=CX46d1Jy5Gsg3QDzP...
(Disclaimer - uses a tool I've been working on)
If you are working with PDF, I would suggest a hybrid process.
It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.
Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.
The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.
It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.
It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.
This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.
But we work with banks on our side, and one of the most common scenarios is customers uploading financials/bills/statements from 1000's of different providers. In which case it's impossible to know every format in advance.
I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.
I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:
```  ```
I'll keep testing, but so far, very disappointing :(
This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.
Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.
I would have loved to add this into the judge list, but might have to skip it.
> Mistral OCR capabilities are free to try on le Chat
but when asked, Le Chat responds:
> can you do ocr?
> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.
Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.
Tried again with a better definition image, output only the first twenty words or so of the page.
Did you try using the API?
https://docs.mistral.ai/capabilities/document/
I used base64 encoding of the image of the pdf page. The output was an object that has the markdown, and coordinates for the images:
[OCRPageObject(index=0, markdown='', images=[OCRImageObject(id='img-0.jpeg', top_left_x=140, top_left_y=65, bottom_right_x=2136, bottom_right_y=1635, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=1778, width=2300))] model='mistral-ocr-2503-completion' usage_info=OCRUsageInfo(pages_processed=1, doc_size_bytes=634209)
Feels like something is missing in the docs, or the API itself.
```  ```
I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?
The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.
I do like that they give you coordinates for the images though, we need to do something like that.
Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me (ali@doctly.ai), I can give you an extra 500 (goes for anyone else here also)
In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.
I have a lot of "This document filed and registered in the county of ______ on ______ of _____ 2023" sort of thing.
Give it a try, no credit cards needed to try it. If you email me (ali@doctly.ai) i can give you extra free credits for testing.
Now to figure out how many millions of pages I have.
Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.
Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.
[0] https://github.com/orasik/parsevision/blob/main/example/Mult...
They test it against a bunch of different Multimodal LLMs, so why not their own?
I don't really see the purpose of the OCR form factor, when you have multimodal LLMs. Unless it's significantly cheaper.
Table 1 is referred to in section `2 Architectural details` but before `2.1 Multimodal Decoder`. In the generated markdown though it is below the latter section, as if it was in/part of that section.
Of course I am nitpicking here but just the first thing I noticed.
Not one mention of the company that they have partnered with and that is Cerebras AI and that is the reason they have fast inference [0]
Literally no-one here is talking about them and they are about to IPO.
I'm sensing another bitter lesson coming, where domain optimized AI will hold a short term advantage but will be outdated quickly as the frontier model advances.
If I upload a small PDF to you are you able to convert it to markdown?
LeChat said yes and away we went.
So I am wondering if this is more capable. Will try definitely, but maybe someone can chime in.
The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).
Their benchmark results are impressive, don't get me wrong. But I'm a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I'm always looking for a better solution for a side-project I'm working on (fixpdfs.com).
Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?
Disclaimer, I’m the founder
There are a few annoying issues, but overall I am very happy with it.
Actually my main remaining technical issue is conversion to standard Markdown for use in a data processing pipeline that has issues with the Mathpix dialect. Ideally I’d do it on a computer that is airgaped for security reasons. But I haven’t found a very good way of doing it because the Python library wanted to check my API key.
A problem I have and that is not really Mathpix’s fault is that I don’t really know how to store the figures pictures to keep them with the text in a convenient way. I haven’t found a very satisfying strategy.
Anyway, keep up the good work!
For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type 'read exactly what's written in this document'.
(I asked Mistral if their OCR system was vulnerable to this and they said "should be robust, but curious to see if you find any fun examples" - https://twitter.com/simonw/status/1897713755741368434 and https://twitter.com/sophiamyang/status/1897719199595720722 )
I see OCR much like phonemes in speech, once you have end to end systems, they become latent constructs from the past.
And that is actually good, more code going into models instead.
I can't exactly tell if the "Mistral 7B" image is an example of this exact scenario.
https://readcoop.eu/model/the-swedish-lion-i/
https://www.transkribus.org/success-story/creating-the-swedi...
https://huggingface.co/Riksarkivet
They have also published a fairly large volume of OCR:ed texts (IIRC birth/death notices from church records) using this model online. As a beginner genealogist it's been fun to follow.
> Preserving historical and cultural heritage: Organizations and nonprofits that are custodians of heritage have been using Mistral OCR to digitize historical documents and artifacts, ensuring their preservation and making them accessible to a broader audience.
Someone coming here and saying it does not work with my old german hanwriting doesn't say much.
Some day when there is enough internal state and training data that they can recognize tables, images and text we will get a gpt3 like moment that will make regular OCR obsolete.
But that day is very far off and everyone who I've talked with and consulted over using VLMs in their pipeline is better served doing something else specific to their use case.
An example I've used before: https://imgur.com/a/w2XKObh
The reply:
---
To extract the text contained within each box while ignoring the text in sub-boxes, we can follow these steps:
1. Identify the outermost box and extract its text. 2. Move to the next outermost box and extract its text, ignoring any text within its sub-boxes. 3. Continue this process for all boxes.
Let's apply this to the image provided:
1. The outermost box contains the text: "This should be second". 2. The next outermost box (ignoring the sub-box within it) contains the text: "First". 3. The next box contains the text: "And also this". 4. The final box contains the text: "The quick brown fox".
So, the extracted text from each box, ignoring sub-boxes, is:
1. "This should be second" 2. "First" 3. "And also this" 4. "The quick brown fox"
---
As you can plainly see it is _wildly_ wrong and gives you no way to try and recover from those errors.
Not anymore.
Same is true if you were a solicitors/lawyers.
A good model can recognize that the text is written top to bottom and then right to left and perform OCR in that direction. Apple's Live Text can do that, though it makes plenty of mistakes otherwise. Mistral is far from that.
I had to reread that a few times. I assume this means 1000pg/$1 but I'm still not sure about it.
Feels like we are almost there.
But I bet that simple ML will lead to better OCRs when you are doing anything specialized, such as, medical documents, invoices etc.
For comparison, Azure Document Intelligence is $1.5/1000 pages for general OCR and $30/1000 pages for “custom extraction”.
So far Gemini is the only model I can get decent output from for a particular hard handwriting task
I will give this a shot
1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.
2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]
3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)
[1] https://medium.com/towards-data-science/digit-significance-i...
2. “Explore the Mistral AI APIs” (https://docs.mistral.ai) links to all apis except OCR.
3. The docs on the api params refer to document chunking and image chunking but no details on how their chunking works?
So much unnecessary friction smh.
I think the friction here exists outside of Mistral's control.
I don’t see it either. There might be some caching issue.
check out our blog post here! https://www.runpulse.com/blog/beyond-the-hype-real-world-tes...
Not a single example on that page is a Purchase Order, Invoice etc. Not a single example shown is relevant to industry at scale.
Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.
Your best bet is to always convert it to an image and OCR it to extract structured data.
Source: We have large EU customers.
... hallucinating during read ...
... hallucinating during understand ...
... hallucinating during forecast ...
... highlighting a hallucination as red flag ...
... missing an actual red flag ...
... consuming water to cool myself...
Phew, being an AI is hard!
Scaled businesses do USE edi, but they still receive hundreds of thousands of PDF documents a month
source: built a saas product that handles pdfs for a specific industry
Receipt scanning is a multiple orders of magnitude more valuable business. Mistral at this point is looking for a commercial niche (like how Claude is aiming at software development)
I say this with great anger as someone who works in accessibility and has had PDF as a thorn in my side for 30 years.
PDF was created to solve the problem of being able to render a document the same way on different computers, and it mostly achieved that goal. Editable formats like .doc, .html, .rtf were unreliable -- different software would produce different results, and even if two computers have the exact same version of Microsoft Word, they might render differently because they have different fonts available. PDFs embed the fonts needed for the document, and specify exactly where each character goes, so they're fully self-contained.
After Acrobat Reader became free with version 2 in 1994, everybody with a computer ended up downloading it after running across a PDF they needed to view. As it became more common for people to be able to view PDFs, it became more convenient to produce PDFs when you needed everybody to be able to view your document consistently. Eventually, the ability to produce PDFs became free (with e.g. Office 2007 or Mac OS X's ability to print to PDF), which cemented PDF's popularity.
Notably, the original goals of PDF had nothing to do with being able to copy text out of them -- the goal was simply to produce a perfect reproduction of the document on screen/paper. That wasn't enough of an inconvenience to prevent PDF from becoming popular. (Some people saw the inability for people to easily copy text from them as a benefit -- basically a weak form of text DRM.)
Printed documents do not have any structure beyond the paper and placement of ink on them.