1756 pointsby littlemermana year ago139 comments
  • vikpa year ago
    I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .

    Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.

    You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .

    The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.

    Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

    • lolindera year ago
      > with LLM as a judge

      For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

      Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

      I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

      [0] https://github.com/VikParuchuri/marker/blob/master/benchmark...

      • themanmarana year ago
        We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:

        - Every document has ground truth text, a JSON schema, and the ground truth JSON.

        - Run OCR on each document and pass the result to GPT-4o along with the JSON Schema

        - Compare the predicted JSON against the ground truth JSON for accuracy.

        In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.

        https://github.com/getomni-ai/benchmark

        • cdolana year ago
          were you guys able to finish running the benchmark with mistral and got a 70% score? Missed that

          Edit - I see it on the Benchmark page now. Woof, low 70% scores in some areas!

          https://getomni.ai/ocr-benchmark

          • themanmarana year ago
            Yup, surprising results! We were able to dig in a bit more. Main culprit is the overzealous "image extraction". Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002).

            And it happened with a lot of full documents as well. Ex: most receipts got classified as images, and so it didn't extract any text.

            • cdolana year ago
              This sounds like a real problem and hurdle for North American (US/CAN in particular) invoice and receipt processing?
            • lingjiekonga year ago
              where do you find this regarding "Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002)."?
              • culia year ago
                themanmaran works at Omni so presumably they have access to the actual resulting data from this study
        • someothherguyya year ago
          Wouldn't that just bias itself to the shape of the text extracted from the OCR against the shape of the raw text alone? It doesn't seem like it would be a great benchmark for estimating semantic accuracy?
      • vikpa year ago
        Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

        I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

        There are a few different benchmark types in the marker repo:

          - Heuristic (edit distance by block with an ordering score)
          - LLM judging against a rubric
          - LLM win rate (compare two samples from different providers)
        
        None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

        I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.

      • arthurcollea year ago
        You can use structured outputs, or something like my https://arthurcolle--dynamic-schema.modal.run/

        to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema

        • cdolana year ago
          What is the project? It just returns a vanilla html page saying:

          Dynamic Schema API API is running. See documentation for available endpoints.

          • arthurcollea year ago
            It's just a FastAPI app with endpoints that I developed and deployed before OpenAI released structured outputs that used a custom grammar to enforce a pydantic-like schema for Chain of Thought rollouts / structured data extraction from unstructured text. I also use it for a video transcription knowledge base generation API

            https://arthurcolle--dynamic-schema.modal.run/docs

    • carlgreenea year ago
      Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried
      • vikpa year ago
        Thanks for sharing! I'm training some models now that will hopefully improve this and more :)
    • netdevphoenixa year ago
      LLM as a judge?

      Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption

      • bforsa year ago
        Perhaps they already evaluated their LLM judge model (with another LLM)
    • a year ago
      undefined
    • ntkrisa year ago
      This is awesome. Have you seen / heard of any benchmarks where the data is actually a structured JSON vs. markdown?
    • ChrisRoba year ago
      Thanks for the tip. Marker solved a table conversion without LLM that docling wasn't able to solve.
    • codeliona year ago
      Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.
    • DeathArrowa year ago
      >Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

      To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?

      • boxeda year ago
        Why wouldn't hallucinations be agreed upon if they have roughly the same training data?
        • TJSomethinga year ago
          A hallucination is often an indication that the model doesn't know something. Then, the internal signal gets dominated by noise from the seeded training weights. Efforts to eliminate hallucinations with a single model have found success by asking the same question in different ways and only taking answers that agree. Logically, you could get more durable results from multiple models on the same prompt.
          • supriyo-biswasa year ago
            We had this article the other day[1] about how multiple LLMs can hallucinate about the same thing, so this is not guaranteed to remove hallucinations that are caused by poor or insufficient training data.

            [1] https://news.ycombinator.com/item?id=43222027

          • boxeda year ago
            I don't see why any of that makes logical sense. These models require such enormous training data that they pretty much MUST use the same training data to a very large degree. The training data itself is what they spit out. So "hallucinations" are just the training data you get out, which is the entire point of the models in the first place. There is no difference between an hallucination and a correct answer from the perspective of the math.
            • neuronica year ago
              Isn' it just statistical word pattern prediction based on training data? These models likely don't "know" something anyway and cannot verify "truth" and facts. Reasoning attempts seem to me basically just like looping until the model finds a self-satisfying equilibrium state with different output.

              In that way, LLMs are more human than, say, a database or a book containing agreed-upon factual information which can be directly queried on demand.

              Imagine if there was just ONE human with human limitations on the entire planet who was taught everything for a long time - how reliable do you think they are with information retrieval? Even highly trained individuals (e.g. professors) can get stuff wrong on their specific topics at times. But this is not what we expect and demand from computers.

    • stavrosa year ago
      I like the licensing options! Hopefully they make enough money to fund development.
  • bambaxa year ago
    It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:

    https://i.imgur.com/jcwW5AG.jpeg

    For the blocks in the center, it outputs:

    > Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l'Aubespine, morte le 6 octobre 1725.

    This is perfect! But then the next one:

    > Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.

    This is really bad because

    1/ a portion of the text of the previous bloc is repeated

    2/ a portion of the next bloc is imported here where it shouldn't be ("Cressonsac"), and of the right most bloc ("Chastelet")

    3/ but worst of all, a whole word is invented, "mortilhomme" that appears nowhere in the original. (The word doesn't exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and "feel right" in the context.)

    (Correct text for the second bloc should be:

    > Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.)

    • layer8a year ago
      > This is perfect!

      Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”. These are https://fr.wikipedia.org/wiki/Adverbe_ordinal#Premiers_adver....

      There’s also extra spaces after the “1607” and around the hyphen in “Diane-Henriette”.

      Lastly, U+2019 instead of U+0027 would be more appropriate for the apostrophe, all the more since in the image it looks like the former and not like the latter.

      • MatthiasPortzela year ago
        Slightly unrelated, but I once used Apple’s built-in OCR feature LiveText to copy a short string out of an image. It appeared to work, but I later realized it had copied “M” as U+041C (Cyrillic Capital Letter Em), causing a regex to fail to match. OCR giving identical characters is only good enough until it’s not.
      • jorvia year ago
        > Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”

        Or degree symbol. Although it should be able to figure out which to use according to the context.

      • TeMPOraLa year ago
        This is "reasoning model" stuff even for humans :).
        • layer8a year ago
          There is OCR software that analyses which language is used, and then applies heuristics for the recognized language to steer the character recognition in terms of character sequence likelihoods and punctuation rules.

          I don’t think you need a reasoning model for that, just better training; although conversely a reasoning model should hopefully notice the errors — though LLM tokenization might still throw a wrench into that.

      • raffraffraffa year ago
        It feels like, after the OCR step there should be language and subject matter detection, with a final sweep with a spelling / grammar checker that has the right "dictionary" selected. (That, right there, is my naivety on the subject, but I would have thought that the type of problem you're describing isn't OCR but classical spelling and grammar checking?)
        • layer8a year ago
          It’s OCR because the wrong characters are being recognized. This is not about fixing spelling or punctuation mistakes present in the source image, it’s that errors are being introduced, due to a lack of accuracy of this OCR with regard to punctuation and typography. The punctuation errors are not different in principle from the case of the OCR producing a misspelled word that wasn’t misspelled in the image being OCRed.

          A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text.

          And specifically for the “white circle” character, it would be difficult to correctly infer the original ordinal markers after the fact. I myself could only do so by inspecting the original image, i.e. by having my brain redo the OCR.

          • raffraffraffa year ago
            > A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text

            I suppose that depends on why it's wrong. Did the model accurately read a real typo in the image or did it incorrectly decipher a character? If a spelling & grammar pass fixes the latter, isn't it valid?

        • pbhjpbhja year ago
          Not unrelated - OneNote 'copy text from image' has started producing lots of incorrect OCR results, but they're all non-words.

          For example, from a clear image of a printed page (in a standard font), it will give me 'cornprising' instead of 'comprising'; 'niatter' instead of 'matter'. Excepting the spell-check underline they'd be hard to spot as with relatively tight kerning all the errors look like the originals.

          I'm surprised as 1) I've not had these sorts of errors before, 2) they're not words, and words must be heavily weighted for in the OCR engine (I'd have thought).

    • bambaxa year ago
      Another test with a text in English, which is maybe more fair (although Mistral is a French company ;-). This image is from Parliamentary debates of the parliament of New Zealand in 1854-55:

      https://i.imgur.com/1uVAWx9.png

      Here's the output of the first paragraph, with mistakes in brackets:

      > drafts would be laid on the table, and a long discussion would ensue; whereas a Committee would be able to frame a document which, with perhaps a few verbal emundations [emendations], would be adopted; the time of the House would thus be saved, and its business expected [expedited]. With regard to the question of the comparative advantages of The-day [Tuesday]* and Friday, he should vote for the amendment, on the principle that the wishes of members from a distance should be considered on all sensations [occasions] where a principle would not be compromised or the convenience of the House interfered with. He hoped the honourable member for the Town of Christchurch would adopt the suggestion he (Mr. Forssith [Forsaith]) had thrown out and said [add] to his motion the names of a Committee.*

      Some mistakes are minor (emnundations/emendations or Forssith/Forsaith), but others are very bad, because they are unpredictable and don't correspond to any pattern, and therefore can be very hard to spot: sensations instead of occasions, or expected in lieu of expedited... That last one really changes the meaning of the sentence.

    • spudlyoa year ago
      I want to rejoice that OCR is now a "solved" problem, but I feel like hallucinations are just as problematic as the kind of stuff I have to put up with tesseract -- both require careful manual proofreading for an acceptable degree of confidence. I guess I'll have to try it and see for myself just how much better these solutions are for my public domain archive.org Latin language reader & textbook projects.
      • qingcharlesa year ago
        It depends on your use-case. For mine, I'm mining millions of scanned PDF pages to get approximate short summaries of long documents. The occasional hallucination won't damage the project. I realize I'm an outlier, and I would obviously prefer a solution that was as accurate as possible.
      • eMPee584a year ago
        possibly doing both & diffing the output to spot contested bits?
        • spudlyoa year ago
          that’s my current idea, use two different ocr models and diff the results to spot check for errors. at these prices why not?
    • thomasfromcdnjsa year ago
      Does anyone know the correlation between our abilities to parse PDF's and the quality of our LLM's training datasets?

      If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?

    • rossanta year ago
      Your example doesn't seem that difficult to me.
    • samstavea year ago
      [flagged]
    • Kokichia year ago
      All it ever does is hallucinate
  • owenpalmera year ago
    This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.

    Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.

    It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.

    Lots of potential here.

    [0] https://docs.withorbit.com/

    • NalNezumia year ago
      >a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.

      A tangent but this exact issue is what I was frustrated for a long time with pdf reader and reading science papers. Then I found sioyek that pops up a small window when you hover over links (references and equations and figures) and it solved it.

      Granted, the pdf file must be in right format, so OCR could make this experience better. Just saying the UI component of that already exist

      https://sioyek.info/

      • PerryStylea year ago
        Zotero's PDF viewer also does this now. Being able to annotate PDFs and having a reference manager has been a life saver.
      • owenpalmera year ago
        Thanks for the link! Good to know someone is working on something similar.
    • generalizationsa year ago
      Wait does this deal with images?
      • ezfea year ago
        The output includes images from the input. You can see that on one of the examples where a logo is cropped out of the source and included in the result.
  • Asraelitea year ago
    I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.
    • randomNumber7a year ago
      I never thought driving a car is harder than editing a pdf.
      • pzoa year ago
        It's not about harder but about what error you can tolerate. Here if you have accuracy 99% for many applications it's enough. If you have 99% accuracy per trip of no crash during self driving then you gonna be dead within a year very likely.

        For cars we need accuracy at least 99.99% and that's very hard.

        • rtsila year ago
          I doubt most people have 99% accuracy. The threshold of tolerance for error is just much lower for any self-driving system (and with good reason, because we're not familiar with them yet).
          • KeplerBoya year ago
            How do you define 99% accuracy?

            I guess something like success rate for a trip (or mile) would be a more reasonable metric. Most people have a success rate far higher than 99% for averages trips.

            Most people who commute daily are probably doing something like a 1000 car rides a year and have minor accidents every few years. 99% success rates would mean monthly accidents.

        • lynx97a year ago
          [dead]
    • toephu2a year ago
      I've been able to edit PDFs (95%+ of them) accurately for the past 10 years...
    • Apofisa year ago
      Foxit PDF exists...
  • raunakchowdhuria year ago
    We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-gemini

    A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.

    • shrisukhania year ago
      Anecdotally, we also found Gemini Flash to be better.
    • hackernewdsa year ago
      meanwhile, you're comparing it to the output of almost a trillion dollar company
      • stanna year ago
        The tagline boasts that it is "introducing the world’s best document understanding API". So, holding them to their marketing seems fair
        • neuronica year ago
          Isn't anyone who releases anything putting "the world's best blablabla" on their page nowadays? I've become entirely blind to it.
          • dwedgea year ago
            If they put it, and it's subpar, I write off the product.
      • HaZeusta year ago
        ... And? We're judging it for the merits of the technology it purports to be, not the pockets of the people that bankroll them. Probably not fair - sure, but when I pick my OCR, I want to pick SOTA. These comparisons and announcements help me find those.
      • raunakchowdhuria year ago
        comparisons to more outputs coming soon!
  • kbyatnala year ago
    We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

    However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

    You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!

    Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)

    • dml2135a year ago
      One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.

      PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.

      • kbyatnala year ago
        Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?

        But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).

      • golergkaa year ago
        It really depends on their fault tolerance. I think there's a ton of useful applications where OCR would be 99.9%, 99%, and even 98% reliable. Skillful product manager can keep these limitations in mind and work around them.
      • jocodaa year ago
        ... unavoidable "human in the loop" - depends imo.

        From the comments here, it certainly seems that for general OCR it's not up to snuff yet. Luckily, I don't have great ambitions.

        I can see this working for me with just a little care upfront preprocessing now that I know where it falls over. It casually skips portions of the document, and misses certain lines consistently. Knowing that I can do a bit massaging, and feed it what I know it likes, and then reassemble.

        I found in testing that it failed consistently at certain parts, but where it worked, it worked extremely well in contrast to other methods/services that I've been using.

    • risyachkaa year ago
      >> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

      -OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.

      • esafaka year ago
        They're going to become commoditized unless they add value elsewhere. Good news for customers.
        • TeMPOraLa year ago
          They are (or at least could easily be) adding value in form of SLA - charging money for giving guarantees on accuracy. This is both better for customer, who gets concrete guarantees and someone to shift liability to, and for the vendor, that can focus on creating techniques and systems for getting that extra % of reliability out of the LLM OCR process.

          All of the above are things companies - particularly larger ones - are happy to pay for, because ORC is just a cog in the machine, and this makes it more reliable and predictable.

          On top of the above, there are auxiliary value-adds such a vendor could provide - such as, being fully compliant with every EU directive and regulation that's in power, or about to be. There's plenty of those, they overlap, and no one wants to deal with it if they can outsource it to someone who already figured it out.

          (And, again, will take the blame for fuckups. Being a liability sink is always a huge value-add, in any industry.)

    • techwizrda year ago
      The challenge I have is how to get bounding boxes for the OCR, for things like redaction/de-identification.
      • dontlikeyoueitha year ago
        AWS Textract works pretty well for this and is much cheaper than running LLMs.
        • daemonologista year ago
          Textract is more expensive than this (for your first 1M pages per month at least) and significantly more than something like Gemini Flash. I agree it works pretty well though - definitely better than any of the open source pure OCR solutions I've tried.
      • kbyatnala year ago
        yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.
      • yfontanaa year ago
        I'm working on a projet that uses PaddleOCR to get bounding boxes. It's far from perfect, but it's open source and good enough for our requirements. And it can mostly handle a 150 MB single-page PDF (don't ask) without completely keeling over.
    • einpokluma year ago
      An LLM with billions of parameters for extracting text from a PDF (which isn't even a rasterized image) really does not "solve OCR".
    • nextworddeva year ago
      Your customer includes Checkr? Impressive. Are they referencable?
      • wahnfriedena year ago
        btw - what 'dark patterns' does portkey contain?
  • mvaca year ago
    Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].

    Also the collab link in the article is broken, found a functional one [2] in the docs.

    [1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...

    • owenpalmera year ago
      I've been searching relentlessly for something like this! I wonder why it's been so hard to find... is it the Chinese?

      In any case, thanks for sharing.

    • thelittleonea year ago
      Have you had a chance to compare results from MinerU vs LLM such a Gemini 2.0 or anthropic's native PDF tool?
      • mvaca year ago
        Yes, i have. The problem with using just an LLM is that while it reads and understands text, but it cannot reproduce it accurately. Additionaly the textbooks I've mentioned have many diagrams and illustrations in them (e.g. books on anatomy or biochemistry). I don't really care about extracting text from them, I just need them extracted as images alongside the text, and no LLM does that.
  • shekhargulatia year ago
    Mistral OCR made multiple mistakes in extracting this [1] document. It is a two-page-long PDF in Arabic from the Saudi Central Bank. The following errors were observed:

    - Referenced Vision 2030 as Vision 2.0. - Failed to extract the table; instead, it hallucinated and extracted the text in a different format. - Failed to extract the number and date of the circular.

    I tested the same document with ChatGPT, Claude, Grok, and Gemini. Only Claude 3.7 extracted the complete document, while all others failed badly. You can read my analysis here [2].

    1. https://rulebook.sama.gov.sa/sites/default/files/en_net_file... 2. https://shekhargulati.com/2025/03/05/claude-3-7-sonnet-is-go...

  • vessenesa year ago
    Dang. Super fast and significantly more accurate than google, Claude and others.

    Pricing : $1/1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?

    Anyway this looks great at pdf to markdown.

    • sophiebitsa year ago
      Batched often means a higher latency option (minutes/hours instead of seconds), which providers can schedule more efficiently on their GPUs.
    • abirajaa year ago
      Batching likely means the response is not real-time. You set up a batch job and they send you the results later.
      • ozima year ago
        If only business people I work with would understand 100GB even transfer over the network is not going to return immediately results ;)
      • vessenesa year ago
        That makes sense. Idle time is nearly free after all.
    • kapitalxa year ago
      From my testing so far, it seems it's super fast and responded synchronously. But it decided that the entire page is an image and returned `![img-0.jpeg](img-0.jpeg)` with coordinates in the metadata for the image, which is the entire page.

      Our tool, doctly.ai is much slower and async, but much more accurate and gets you the content itself as an markdown.

      • raluseka year ago
        I thought we stopped -ly company names ~8 years ago?
        • kapitalxa year ago
          Haha for sure. Naming isn't just the hardest problem in computer science, it's always hard. But at some point you just have to pick something and move forward.
        • yieldcrva year ago
          if you talk to people gen-x and older, you still need .com domains

          for all those people that aren't just clicking on a link on their social media feed, chat group, or targeted ad

        • DonHopkinsa year ago
          But doctr.ai was taken.
    • Tostinoa year ago
      Usually (With OpenAI, I haven't checked Mistral yet) it means an async api rather than a sync api.

      e.g. you submit multiple requests (pdfs) in one call, and get back an id for the batch. You then can check on the status of that batch and get the results for everything when done.

      It lets them use their available hardware to it's full capacity much better.

    • odiroota year ago
      May I ask as a layperson, how would you about using this to OCR multiple hundreds of pages? I tried the chat but it pretty much stops after the 2nd page.
      • bekleina year ago
        You can check the example code on the Mistral documentation, you would _only_ have to change the value of the variable `document_url` to the URL of your uploaded PDF... and you need to change the `MISTRAL_API_KEY` to the value of your specific key that you can get from the Le Platforme webpage.

        https://docs.mistral.ai/capabilities/document/#ocr-with-pdf

      • sneaka year ago
        Submit the pages via the API.
        • odiroota year ago
          This worked indeed. Although I had to cut my document into smaller chunks. 900 pages at once ended with a timeout.
    • jacksnipea year ago
      I would assume this is 1 request containing 2k pages vs N requests whose total pages add up to 1000.
  • serjestera year ago
    This is cool! With that said for anyone looking to use this in RAG, the downside to specialized models instead of general VLMs is you can't easily tune it to your use specific case. So for example, we use Gemini to add very specific alt text to images in the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash - hopefully the increased performance is significant.

    Regardless excited to see more and more competition in the space.

    Wrote an article on it: https://www.sergey.fyi/articles/gemini-flash-2-tips

    • hyuuua year ago
      gemini flash is notorious for hallucinating the output of the OCR, be careful with it. For straight forward, semi-structured, low page count (under 5) it should perform well, but the more the context window is stretched the more the output gets more unreliable
    • a year ago
      undefined
  • sbarrea year ago
    6 years ago I was working with a very large enterprise that was struggling to solve this problem, trying to scan millions of arbitrary forms and documents per month to clearly understand key points like account numbers, names and addresses, policy numbers, phone numbers, embedded images or scribbled notes, and also draw relationships between these values on a given form, or even across forms.

    I wasn't there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they'd tried, from brute-force training on templates (didn't scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..

    I have to imagine this is a problem shared by so many companies.

  • opwieurposiua year ago
    Related, does anyone know of an app that can read gauges from an image and log the number to influx? I have a solar power meter in my crawlspace, it is inconvenient to go down there. I want to point an old phone at it and log it so I can check it easily. The gauge is digital and looks like this:

    https://www.pvh2o.com/solarShed/firstPower.jpg

    • ubergeek42a year ago
      This[1] is something I've come across but not had a chance to play with, designed for reading non-smart meters that might work for you. I'm not sure if there's any way to run it on an old phone though.

      [1] https://github.com/jomjol/AI-on-the-edge-device

      • jasonjayra year ago
        Wow. I was looking at hooking my water meter into home assistant, and was going to investigate just counting an optical pulse (it has a white portion on the gear that is in a certain spot every .1 gal) This is like the same meter I use, and perfect.

        (It turns out my electric meter, though analog, blasts out it's reading on RF every 10 seconds unencrypted. I got that via my RTL-SDR reciever :) )

      • timc3a year ago
        I use this for a watermeter. Works quite well as long as you have a good SD card
    • dehrmanna year ago
      You'll be happier finding a replacement meter that has an interface to monitor it directly or a second meter. An old phone and OCR will be very brittle.
      • haswella year ago
        Not OP, but it sounds like the kind of project I’d undertake.

        Happiness for me is about exploring the problem within constraints and the satisfaction of building the solution. Brittleness is often of less concern than the fun factor.

        And some kinds of brittleness can be managed/solved, which adds to the fun.

        • arcfoura year ago
          I would posit that learning how the device works, and how to integrate with a newer digital monitoring device would be just as interesting and less brittle.
          • haswella year ago
            Possibly! But I’ve recently wanted to dabble with computer vision, so I’d be looking at a project like this as a way to scratch a specific itch. Again, not OP so I don’t know what their priorities are, but just offering one angle for why one might choose a less “optimal” approach.
    • ramses0a year ago
      https://www.home-assistant.io/integrations/seven_segments/

      https://www.unix-ag.uni-kl.de/~auerswal/ssocr/

      https://github.com/tesseract-ocr/tesseract

      https://community.home-assistant.io/t/ocr-on-camera-image-fo...

      https://www.google.com/search?q=home+assistant+ocr+integrati...

      https://www.google.com/search?q=esphome+ocr+sensor

      https://hackaday.com/2021/02/07/an-esp-will-read-your-meter-...

      ...start digging around and you'll likely find something. HA has integrations which can support writing to InfluxDB (local for sure, and you can probably configure it for a remote influxdb).

      You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power Management / humidity management / waterproof electrical box" to stuff it into, and then either YOLO and DIY to shoot over to your influxdb, or set up a Home Assistant and "attach" your frankenbox as some sort of "sensor" or "integration" which spits out metrics and yadayada...

    • renewiltorda year ago
      4o transcribes it perfectly. You can usually root an old Android and write this app in ~2h with LLMs if unfamiliar. The hard part will be maintaining camera lens cleanliness and alignment etc.

      The time cost is so low that you should give it a gander. You'll be surprised how fast you can do it. If you just take screenshots every minute it should suffice.

      • pavla year ago
        What software-tools do you usw to Programm the APP?
        • renewiltorda year ago
          Since it's at home, you'll have WiFi access, so it's pretty much a rudimentary Kotlin app on Android. You can just grab a photo and ship it to the GPT-4o API, get the response, and then POST it somewhere.
    • BonoboIOa year ago
      Gemini Free Tier would surely work
  • evmara year ago
    I noticed on the Arabic example they lost a space after the first letter on the third to last line, can any native speakers confirm? (I only know enough Arabic to ask dumb questions like this, curious to learn more.)

    Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.

    Edit2: here's a picture of what I'm talking about, the before/after: https://ibb.co/v6xcPMHv

    • resirosa year ago
      Arabic speaker here. No, it's perfect.
      • evmara year ago
        I am pretty sure it added a kasrah not present in the input on the 2nd to last line. (Not saying it's not super impressive, and also that almost certainly is the right word, but I think that still means not quite "perfect"?)
        • gl-proda year ago
          Yes, it looks like it did add a kasrah to the word ظهري
          • yoda97a year ago
            Yep, and فمِنا too, this is not just OCR, it made some post-processing corrections or "enhancements". That could be good, but it could also be trouble the 1% chance it makes a mistake in critical documents.
      • gl-proda year ago
        He means the space between the wāw (و) and the word
        • evmara year ago
          I added a pic to the original comment, sorry for not being clear!
    • albatrosstrophya year ago
      And here I thought after reading the headline: finally a reliable Arabic OCR. I've never in my life found a good that does the job decently especially for a scanned document. Or is there something out there I don't know about?
  • lysacea year ago
    Nit: Please change the URL from

    https://mistral.ai/fr/news/mistral-ocr

    to

    https://mistral.ai/news/mistral-ocr

    The article is the same, but the site navigation is in English instead of French.

    Unless it's a silent statement, of course. =)

    • lblumea year ago
      For me, the second page redirects to the first. (And I don't live in France.)
  • porphyraa year ago
    I uploaded a picture of my Chinese mouthwash [0] and it made a ton of mistakes and hallucinated a lot. Very disappointing. For example it says the usage instructions is to use 80 ml each time, even though the actual usage instruction on the bottle says use 5-20 mL each time, three times a day, and gargle for 1 minute.

    [0] https://i.imgur.com/JiX9joY.jpeg

    [1] https://chat.mistral.ai/chat/8df2c9b9-ee72-414b-81c3-843ce74...

  • ChemSpidera year ago
    "World's best OCR model" - that is quite a statement. Are there any well-known benchmarks for OCR software?
    • themanmarana year ago
      We published this benchmark the other week. We'll can update and run with Mistral today!

      https://github.com/getomni-ai/benchmark

      • themanmarana year ago
        Update: Just ran our benchmark on the Mistral model and results are.. surprisingly bad?

        Mistral OCR:

        - 72.2% accuracy

        - $1/1000 pages

        - 5.42s / page

        Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.

        https://github.com/getomni-ai/benchmark

        https://huggingface.co/datasets/getomni-ai/ocr-benchmark

        https://getomni.ai/ocr-benchmark

        • Thaxlla year ago
          Do you benchmark the right thing though? It seems to focus a lot on image / charts etc...

          The 95% from their benchmark: "we evaluate them on our internal “text-only” test-set containing various publication papers, and PDFs from the web; below:"

          Text only.

          • themanmarana year ago
            Our goal is to benchmark on real world data. Which is often more complex than plain text. If we have to make the benchmark data easier for the model to perform better, it's not an honest assessment of the reality.
      • kergonatha year ago
        Excellent. I am looking forward to it.
      • cdolana year ago
        Came here to see if you all had run a benchmark on it yet :)
    • WhitneyLanda year ago
      It’s interesting that none of the existing models can decode a Scrabble board screen shot and give an accurate grid of characters.

      I realize it’s not a common business case, came across it testing how well LLMs can solve simple games. On a side note, if you bypass OCR and give models a text layout of a board standard LLMs cannot solve Scrabble boards but the thinking models usually can.

    • xnxa year ago
      • ChemSpidera year ago
        Interesting. But no mistral on it yet?
    • resource_wastea year ago
      Its Mistral, they are the only homegrown AI Europe has, so people pretend they are meaningful.

      I'll give it a try, but I'm not holding my breath. I'm a huge AI Enthusiast and I've yet to be impressed with anything they've put out.

  • neoma year ago
    I gave it a bunch of my wifes 18th century English scans to transcribe, mostly couldn't do it, and it's been doing this for 15 minutes now, not sure why but i find quite amusing: https://share.zight.com/L1u2jZYl
  • SilentM68a year ago
    I would like to see how it performs with massively warped and skewed scanned text images, basically a scanned image where the text lines are wavy as opposed as straight horizontal, where the letters are elongated. One where the line widths are different depending on the position on the scanned image. I once had to deal with such a task that somebody gave me with OCR software, Acrobat, and other tools could not decode the mess so I had to recreate the 30 pages myself, manually. Not a fun thing to do but that is a real use case.
    • thegabrielea year ago
      I use gemini to solve textual CAPTCHAS with those kind of distortions and more: 60% of the time it works every time.
    • ameliusa year ago
      Are you trying to build a captcha solver?
      • SilentM68a year ago
        No, not a captcha solver. When I worked in education, I was given a 90s paper document that a teacher needed OCRd but it was completely warped. It was my job to remediate those type of documents for Accessibility reasons. I had to scan and OCR it but the result was garbage. Mind you I had access to Windows, Linux and MacOS tools but still difficult to do. I had to guess what it said, which was not impossible but it was time-consuming, not doable in the time-frame I was given, so I had no option but to manually retype all the information into a new document and convert it that way. Document remediation and accessibility should be a good use case for A.I., in education.
    • arcfoura year ago
      Garbage in, garbage out?
      • edude03a year ago
        "Yes" but if a human could do it "AI" should be able to do it too.
  • janalsncma year ago
    The hard ones are things like contracts, leases, and financial documents which 1) don’t have a common format 2) are filled with numbers proper nouns and addresses which it’s really important not to mess up 3) cannot be inferred from context.

    Typical OCR pipeline would be to pass the doc through a character-level OCR system then correct errors with a statistical model like an LLM. An LLM can help correct “crodit card” to “credit card” but it cannot correct names or numbers. It’s really bad if it replaces a 7 with a 2.

  • raffraffraffa year ago
    Forgive my absolute ignorance, I should probably run this through a chat bot before posting ... So I'm updating my post with answers now!

    Q: Do LLMs specialise in "document level" recognition based on headings, paragraphs, columns tables etc? Ie: ignore words and characters for now and attempt to recognise a known document format.

    A: Not most LLMs, but those with multimodal / vision capability could (eg DeepSeek Vision. ChatGPT 4). There are specialized models for this work like Tesseract, LayoutLM.

    Q: How did OCR work "back in the day" before we had these LLMs? Are any of these methods useful now?

    A: They used pattern recognition and feature extraction, rules and templates. Newer ML based OCR used SVM to isolate individual characters and HMM to predict the next character or word. Today's multimodal models process images and words, can handle context better than the older methods, and can recognise whole words or phrases instead of having to read each character perfectly. This is why they can produce better results but with hallucinations.

    Q: Can LLMs rate their own confidence in each section, maybe outputting text with annotations that say "only 10% certain of this word", and pass the surrounding block through more filters, different LLMs, different methods to try to improve that confidence?

    A: Short answer, "no". But you can try to estimate with post processing.

    Or am I super naive, and all of those methods are already used by the big commercial OCR services like Textract etc?

  • sireata year ago
    Intriguing announcement, however the examples on the mistral.ai page seem rather "easy".

    What about rare glyphs in different languages using handwriting from previous centuries?

    I've been dealing with OCR issues and evaluating different approaches for past 5+ years at a national library that I work at.

    Usual consensus is that widely used open source Tesseract is subpar to commercial models.

    That might be so without fine tuning. However one can perform supplemental training and build your own Tesseract models that can outperform the base ones.

    Case study of Kant's letter's from 18th century:

    About 6 months ago, I tested OpenAi approach to OCR to some old 18th century letters that needed digitizing.

    The results were rather good (90+% accuracy) with the usual hallucination here and there.

    What was funny that OpenAI was using base Tesseract to generate the segmenting and initial OCR.

    The actual OCRed content before last inference step was rather horrid because the Tesseract model that OpenAi was using was not appropriate for the particular image.

    When I took OpenAi off the first step and moved to my own Tesseract models, I gained significantly in "raw" OCR accuracy at character level.

    Then I performed normal LLM inference at the last step.

    What was a bit shocking: My actual gains for the task (humanly readable text for general use) were not particularly significant.

    That is LLMs are fantastic at "untangling" complete mess of tokens into something humanly readable.

    For example:

    P!3goattie -> prerogative (that is given the surrounding text is similarly garbled)

  • cxiea year ago
    The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.

    Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.

    Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.

    I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.

    • themanmarana year ago
      Excited to test this our on our side as well. We recently built an OCR benchmarking framework specifically for VLMs[1][2], so we'll do a test run today.

      From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:

      model | omni | mistral |

      gemini | 86% | 89% |

      azure | 85% | 89% |

      gpt-4o | 75% | 89% |

      google | 68% | 83% |

      Currently adding the Mistral API and we'll get results out today!

      [1] https://github.com/getomni-ai/benchmark

      [2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark

      • themanmarana year ago
        Update: Just ran our benchmark on the Mistral model and results are.. surprisingly bad?

        Mistral OCR:

        - 72.2% accuracy

        - $1/1000 pages

        - 5.42s / page

        Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.

        https://github.com/getomni-ai/benchmark

        https://huggingface.co/datasets/getomni-ai/ocr-benchmark

        https://getomni.ai/ocr-benchmark

      • a year ago
        undefined
      • jaggsa year ago
        By optimistic, do you mean 'tweaked'? :)
    • epolanskia year ago
      At my client we want to provide an AI that can retrieve relevant information from documentation (home building business, documents detail how to install a solar panel or a shower, etc) and we've set up an entire system with benchmarks, agents, etc, yet the bottleneck is OCR!

      We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.

      We simply cannot afford to give our customers incorrect informatiin

      We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.

      Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.

      • PeterStuera year ago
        As someone who has had a home built, and nearly all my friends and acquaintances report the same thing, having a 1% error on information in this business would mean not a 10x but a 50x improvement over the current practice in the field.

        If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.

    • janalsncma year ago
      I have done OCR on leases. It’s hard. You have to be accurate and they all have bespoke formatting.

      It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.

      The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.

    • PeterStuera year ago
      I'd love to try it for my domain (regulation), but $1/1000 pages is significantly more expensive than my current local Docling based setup that already does a great job of processing PDF's for my needs.
      • yawnxyza year ago
        I think for regulated fields / high impact fields $1/1000 is well-worth the price; if the accuracy is close to 100% this is way better than using people, who are still error-prone
        • PeterStuera year ago
          It could be very well worth the price, but it still needs to justify the price increase over an already locally running solution that is nearly free in operation.

          I will still check it out, but given the performance I already have for my specific use case with my current system, my upfront expectation is that it probably will not make it to production.

          I'm sure there are other applications for wich this could be a true enabler.

          I am also biased to using as little SaaS as possible. I prefer services on-prem and under my control where possible.

          I do use GPT-4o for now as, again, for my use case, it significantly outperformed other local solutions I tried.

    • kbyatnala year ago
      re: real world implications, LLMs and VLMs aren't magi, and anyone who goes in expecting 100% automation is in for a surprise (especially in domains like medical or legal).

      IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.

      e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.

      But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.

      Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)

    • kergonatha year ago
      > Has anyone tried this on specialized domains like medical or legal documents?

      I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.

    • janis1234a year ago
      $1 for 1000 pages seems high to me. Doing a google search

      Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour

      I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.

      • blackoila year ago
        Is the model Open Source/Weight? Else the cost is for the model, not GPU.
    • salynchnewa year ago
      Also interesting to see that parts of the training infrastructure to create frontier models is itself being monetized.
    • stavrosa year ago
      What do you mean by "free"? Using the OpenAI vision API, for example, for OCR is quite a bit more expensive than $1/1k pages.
    • ameliusa year ago
      > 94.89% overall accuracy

      There are about 47 characters on average in a sentence. So does this mean it gets around 2 or 3 mistakes per sentence?

    • unboxingelfa year ago
      We’ll just stick LLM Gateway LLM in front of all the specialized LLMs. MicroLLMs Architecture.
      • cxiea year ago
        I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.

        Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.

        The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.

        The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.

        Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.

        • arcfoura year ago
          This is already done with agents. Some agents only have tools and the one model, some agents will orchestrate with other LLMs to handle more advanced use cases. It's pretty obvious solution when you think about how to get good performance out of a model on a complex task when useful context length is limited: just run multiple models with their own context and give them a supervisor model—just like how humans organize themselves in real life.
        • fnordpigleta year ago
          I’m doing this personally for my own project - essentially building an agent graph that starts with the image output, orients and cleans, does a first pass with tesseract LSTM best models to create PDF/HOCR/Alto, then pass to other LLMs and models based on their strengths to further refine towards markdown and latex. My goal is less about RAG database population but about preserving in a non manually typeset form the structure and data and analysis, and there seems to be pretty limited tooling out there since the goal generally seems to be the obviously immediately commercial goal of producing RAG amenable forms that defer the “heavy” side of chart / graphic / tabular reproduction to a future time.
        • unboxingelfa year ago
          Take a look at MCP, Model Context Protocol.
  • notepad0x90a year ago
    I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.

    It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.

    Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.

    At least that is what I imagine the tech would evolve into in 5+ years.

    • groby_ba year ago
      Now? OK, you need to screencap and upload to LLM, but that's well established tech by now. (Where by "well established", I mean at least 9 months old ;)

      Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.

      Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.

      • notepad0x90a year ago
        Yeah, all these things are possible today, but getting them well polished and integrated is another story. Imagine all this being supported by "HTML6" lol. When apple gets around to making this part of safari, then we know it's ready.
        • groby_ba year ago
          That's a great upper-bound estimator ;)

          But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)

          The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.

          So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)

          If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)

          Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?

    • devmora year ago
      Good lord, I dearly hope not. That sounds like a coddled hellscape world, something you'd see made fun of in Disney's Wall-E.
      • notepad0x90a year ago
        hence my comment about privacy and need for legislation :)

        It isn't the tech that's the problem but the people that will abuse it.

        • devmora year ago
          While those are concerns, my point was that having everything on the internet navigated to, digested and explained to me sounds unpleasant and overall a drain on my ability to think and reason for myself.

          It is specifically how you describe using the tech that provokes a feeling of revulsion to me.

          • notepad0x90a year ago
            Then I think you misunderstand. The ML system would know when you want things digested to you or not. Right now companies are assuming this and forcing LLM interaction. But when properly done, the system would know based on your behavior or explicit prompts what you want and provide the service. If you're staring at a paragraph intently and confused, it might start highlighting common phrases or parts of the text/picture that might be hard to grasp and based on your reaction to that, it might start describing things via audio,tool tips,side pane,etc.. In other words, if you don't like how and when you're interacting with the LLM ecosystem, then that is an immature and failing ecosystem, in my vision this would be a largely solved problems, like how we interact with keyboards,mouse and touchscreens today.
            • devmora year ago
              No, I fully understand.

              I am saying that this type of system, that deprives the user of problem solving, is itself a problem. A detriment to the very essence of human intelligence.

              • notepad0x90a year ago
                I just look at it as allowing the user to focus on problems that aren't already easily solved. Like using a calculator instead of calculating manually on paper.
                • devmora year ago
                  But the scenario you described is one in which you need an equation explained to you. That is exactly the kind of scenario where it's important to do the calculation yourself to understand it.

                  If you are expecting problems to be solved for you, you are not learning, you're just consuming content.

    • abrichra year ago
      > I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.

      Seems like https://aiscreenshot.app might fit the bill.

  • groby_ba year ago
    Perusing the web site, it's depressing how much behind Mistral is on basic "how can I make this a compelling hook for customers" for the page.

    The notebook link? An ACL'd doc

    The examples don't even include a small text-to-markdown sample.

    The before/after slider is cute, but useless - SxS is a much better way to compare.

    Trying it in "Le Chat" requires a login.

    It's like an example of "how can we implement maximum loss across our entire funnel". (I have no doubt the underlying tech does well, but... damn, why do you make it so hard to actually see it, Mistral?)

    If anybody tried it and has shareable examples - can you post a link? Also, anybody tried it with handwriting yet?

  • TriangleEdgea year ago
    One of my hobby projects while in University was to do OCR on book scans. Doing character recognition was solved, but finding the relationship between characters was very difficult. I tried "primitive" neural nets, but edge cases would often break what I built. Super cool to me to see such an order of magnitude in improvement here.

    Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.

    • jcuenoda year ago
      * Character recognition on monolingual text in a narrow domain is solved
  • michaelbuckbeea year ago
    I'd mentioned this on HN last month, but I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

    ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

    Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

  • s4ia year ago
    I wonder how good it would be to convert sheet music to MusicXML. All the current tools more or less suck with this task, or maybe I’m just ignorant and don’t know what lego bricks to put together.
  • z2a year ago
    Is there a reliable handwriting OCR benchmark out there (updated, not a blog post)? Despite the gains claimed for printed text, I found (anecdotally) that trying to use Mistral OCR on my messy cursive handwriting to be much less accurate than GPT-4o, in the ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.

    Edit: answered in another post: https://huggingface.co/spaces/echo840/ocrbench-leaderboard

  • oystervillea year ago
    Dupe of an hour previous post https://news.ycombinator.com/item?id=43282489
  • qwertoxa year ago
    We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.

    Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.

    • jgalt212a year ago
      I think you might be looking for PDF/A.

      https://www.adobe.com/uk/acrobat/resources/document-files/pd...

      For example, if you print a word doc to PDF, you get the raw text in PDF form, not an image of the text.

      • gpvosa year ago
        PDF/A doesn't require preserving the document structure, only that any text is extractable.
    • siva7a year ago
      > We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.

      Why Jokes aside? Markdown/html is better suited for the web than pdf

  • climb_stealtha year ago
    Does this support Japanese? They list a table of language comparisons againat other approaches but I can't tell if it is exhaustive.

    I'm hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.

  • protonboba year ago
    Wow this basically "solves" DRM for books as well as opening up the door for digitizing old texts more accurately.
  • bsnnkva year ago
    Someone working there has good taste to include a Nizar Qabbani poem.
  • andoandoa year ago
    Bit unrelated but is there anything that can help with really low resolution text? My neighbor got hit and run the other day for example, and I've been trying every tool I can to make out some of the letters/numbers on the plate

    https://ibb.co/mr8QSYnj

    • zinglersena year ago
      Finding the right subreddit and asking there is probably a better approach if you want to maximize the chances of getting the plate 'decrypted'.
    • rvnxa year ago
      If it’s a video, sharing a few frames can help as well
    • deweya year ago
      To even get started on this you'd also need to share some contextual information like continent, country etc. I'd say.
      • andoandoa year ago
        Its in CA, looks like paper plates which follow a specific format and the last two seem to be the numbers '64'. Police should be able to search for temp tag with partial match and match the make/model. Was curious to see if any software could help though
    • busymom0a year ago
      There are photo enhancers online. But your picture is way too pixelated to get any useful info from it.
      • tjoffa year ago
        If you know the font in advance (which you often do in these cases) you can do insane reconstructions. Also keep in mind that it doesn't have to be a perfect match, with the help of the color and other facts (such as likely location) about the car you can narrow it down significantly.
      • zellyna year ago
        Maybe if you had multiple frames, and used something very clever?
    • flutasa year ago
      Looks like a paper temp tag. Other than that, I'm not sure much can be had from it.
  • jacoopera year ago
    Pretty cool, would love to use this with paperless, but I just can't bring myself to send a photo of all my documents to a third party, especially legal and sensitive documents, which is what I use Paperless for.

    Because of that I'm stuck with crappy vision on Ollama (Thanks to AMDs crappy ROCm support for Vllm)

  • yoevena year ago
    I ran Mistral AI OCR against JigsawStack OCR and beat their model in every category. Full breakdown here: https://jigsawstack.com/blog/mistral-ocr-vs-jigsawstack-vocr
    • 27theoa year ago
      Just a small fyi, as viewed on an iPhone in Safari your tables don’t allow horizontal scrolling, cutting off the right column
  • InvidFlowera year ago
    While it is nice to have more options, it still definitely isn't at a human level yet for hard to read text. Still haven't seen anything that can deal with something like this very well: https://i.imgur.com/n2sBFdJ.jpeg

    If I remember right, Gemini actually was the closest as far as accuracy of the parts where it "behaved", but it'd start to go off the rails and reword things at the end of larger paragraphs. Maybe if the image was broken up into smaller chunks. In comparison, Mistral for the most part (besides on one particular line for some reason) sticks to the same number of words, but gets a lot wrong on the specifics.

  • hubraumhugoa year ago
    It will be interesting to see how all the companies in the document processing space adapt as OCR becomes a commodity.

    The best products will be defined by everything "non-AI", like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.

    • trollieda year ago
      They will offer integrations into enterprise systems, just like they do today.

      Lots of big companies don't like change. The existing document processing companies will just silently start using this sort of service to up their game, and keep their existing relationships.

    • hyuuua year ago
      I 100% agree with this, I think you can even extend this to any AI, in the end, IMO, as the llm is more commoditized, the surface of which the value is delivered will matter more
  • hdjrudnia year ago
    Still terrible at handwriting.

    I signed up for the API, cobbled together from their tutorial (https://docs.mistral.ai/capabilities/document/) -- why can't they give the full script instead of little bits?

    Tried uploading a tiff, they rejected it. Tried upload JPG, they rejected it (even though they supposed support images?). Tried resaving as PDF. It took that, but the output was just bad. Then tried ChatGPT on the original .tiff (not using API), and it got it perfectly. Honestly I could barely make out the handwriting with my eyes but now that I see ChatGPT's version I think it's right.

    • InvidFlowera year ago
      It is confusing, but they have diff calls for pdfs vs images. In their example google colab: https://colab.research.google.com/drive/11NdqWVwC_TtJyKT6cmu...

      The first couple of sections are for pdfs and you need to skip all that (search for "And Image files...") to find the image extraction portion. Basically it needs ImageURLChunk instead of DocumentURLChunk.

  • sixhobbitsa year ago
    Nice demos but I wonder how well it does on longer files. I've been experimenting with passing some fairly neat PDFs to various LLMs for data extraction. They're created from Excel exports and some of the data is cut off or badly laid out, but it's all digitally extractable.

    The challenge isn't so much the OCR part, but just the length. After one page the LLMs get "lazy" and just skip bits or stop entirely.

    And page by page isn't trivial as header rows are repeated or missing etc.

    So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don't seem to have a one-shot solve for that. Maybe this is it?

    • hack_mla year ago
      You will have to send one page at a time, most of this work has to be done via RAG. Adding a large context (like a whole PDF), still does not work that well in my experience.
  • dotnetkowa year ago
    Congrats to the Mistral team for launching! A general-purpose OCR model is useful, of course. However, more purpose-built solutions are a must to convert business documents reliably. AI models pre-trained on specific document types perform better and are more accurate. Coming soon from the ABBYY team, we're shipping a new OCR API designed to be consistent, reliable, and hallucination-free. Check it out if you're looking for best-in-class DX: https://digital.abbyy.com/code-extract-automate-your-new-mus...
  • egorfinea year ago
    I had a need to scan serial numbers from Apple's product boxes out of pictures taken by a random person on their phone.

    All OCR tools that I have tried have failed. Granted, I would get much better results if I used OpenCV to detect the label, rotate/correct it, normalize contrast, etc.

    But... I have tried the then new vision model from OpenAI and it did the trick so well it's wasn't feasible to consider anything else at that point.

    I have checked all S/N afterwards for being correct via third-party API - and all of theme were. Sure, sometimes I had to check versions with 0/o and i/l/1 substitutions but I believe these kind of mistakes are non-issues.

  • pqdbra year ago
    I tried with both PDFs and PNGs in Le Chat and the results were the worst I've ever seen when compared to any other model (Claude, ChatGPT, Gemini).

    So bad that I think I need to enable the OCR function somehow, but couldn't find it.

    • troyvita year ago
      It worked perfectly for me with a simple 2 page PDF that contained no graphics or formatting beyond headers and list items. Since it was so small I had the time to proof-read it and there were no errors. It added some formatting, such as bolding headers in list items and putting tics around file and function names. I won't complain.
    • computergerta year ago
      I'm experiencing the same. Maybe the sentence "Mistral OCR capabilities are free to try on le Chat." was a hallucination.
  • bob1029a year ago
    > It takes images and PDFs as input

    If you are working with PDF, I would suggest a hybrid process.

    It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.

    Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.

    The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.

    • themanmarana year ago
      > Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.

      It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.

      It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.

      • bob1029a year ago
        It's not guessing if the form is known and you can read the information directly.

        This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.

        • themanmarana year ago
          Oh yea if the form is known and standardized everything is a lot easier.

          But we work with banks on our side, and one of the most common scenarios is customers uploading financials/bills/statements from 1000's of different providers. In which case it's impossible to know every format in advance.

  • kapitalxa year ago
    Co-founder of doctly.ai here (OCR tool)

    I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.

    I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:

    ``` ![img-0.jpeg](img-0.jpeg) ```

    I'll keep testing, but so far, very disappointing :(

    This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.

    Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.

    I would have loved to add this into the judge list, but might have to skip it.

    • bambaxa year ago
      Where did you test it? At the end of the post they say:

      > Mistral OCR capabilities are free to try on le Chat

      but when asked, Le Chat responds:

      > can you do ocr?

      > I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.

      Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.

      Tried again with a better definition image, output only the first twenty words or so of the page.

      Did you try using the API?

      • kapitalxa year ago
        Yes I used the API. They have examples here:

        https://docs.mistral.ai/capabilities/document/

        I used base64 encoding of the image of the pdf page. The output was an object that has the markdown, and coordinates for the images:

        [OCRPageObject(index=0, markdown='![img-0.jpeg](img-0.jpeg)', images=[OCRImageObject(id='img-0.jpeg', top_left_x=140, top_left_y=65, bottom_right_x=2136, bottom_right_y=1635, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=1778, width=2300))] model='mistral-ocr-2503-completion' usage_info=OCRUsageInfo(pages_processed=1, doc_size_bytes=634209)

        • sadcraba year ago
          Any luck with this? I'm trying to process photos of paperwork (.pdf, .png) and got the same results as you.

          Feels like something is missing in the docs, or the API itself.

          https://imgur.com/a/1J9bkml

    • fnordpigleta year ago
      Interestingly I’m currently going through and scanning the hundreds of journal papers my grandfather authored in medicine and thinking through what to do about graphs. I was expecting to do some form of multiphase agent based generation of LaTeX or SVG rather than a verbal summary of the graphs. At least in his generation of authorship his papers clearly explained the graphs already. I was pretty excited to see your post naturally but when I looked at the examples what I saw was, effectively, a more verbose form of

      ``` ![img-0.jpeg](img-0.jpeg) ```

      I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?

      • kapitalxa year ago
        We need to update the examples on the front page. Currently for things that are considered charts/graphs/figures we convert to a description. For things like logos or images we do an image tag. You can also choose to exclude them.

        The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.

        I do like that they give you coordinates for the images though, we need to do something like that.

        Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me (ali@doctly.ai), I can give you an extra 500 (goes for anyone else here also)

    • niwtsola year ago
      If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?
      • kapitalxa year ago
        We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.

        In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.

    • Grosvenora year ago
      Does doctly do handwritten forms like dates?

      I have a lot of "This document filed and registered in the county of ______ on ______ of _____ 2023" sort of thing.

      • kapitalxa year ago
        We've been getting great results with those aswell. But ofcourse there is always some chance of not getting it perfect, specially with different handwritings.

        Give it a try, no credit cards needed to try it. If you email me (ali@doctly.ai) i can give you extra free credits for testing.

        • Grosvenora year ago
          Just tried it. Got all the dates correct and even extracted signatures really well.

          Now to figure out how many millions of pages I have.

    • scottydeltaa year ago
      How do you stay competitive with $2/100 pages pricing as compared to mistral and others offering 1000 pages for $1 approx?
      • kapitalxa year ago
        Customers are willing to pay for accuracy compared to existing solutions out there. We started out in need of an accurate solution for a RAG product we were building, but none of the solutions we tried were providing the accuracy we needed.
    • infectoa year ago
      Why pay more for doctly than an AWS Textract?
      • nnurmanova year ago
        I did not try doctly, but AWS Textract does not support in my case Russian, so the output is completely useless
      • kapitalxa year ago
        Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.

        Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.

    • the_mitsuhikoa year ago
      Would love to see the test file.
      • Starlord2048a year ago
        would be glad to see benchmarking results
        • kapitalxa year ago
          This is a good idea. We should publish a benchmark results/comparison.
  • jervanta year ago
    I wonder how it compares to USPS workers at deciphering illegible handwriting.
  • Orasa year ago
    I feel this is created for RAG. I tried a document [0] that I tested with OCR; it got all the table values correctly, but the page's footer was missing.

    Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.

    [0] https://github.com/orasik/parsevision/blob/main/example/Mult...

  • jojogha year ago
    High accuracy is the goal! But the multimodal approach introduces some complexities that can impact real-world performance. We break it down in our review: https://undatas.io/blog/posts/in-depth-review-of-mistral-ocr... As for use cases, it really depends on how well it handles edge cases…
  • yoelhacksa year ago
    I was curious about Mistral so I made a few visualizations.

    A high level diagram w/ links to files: https://eraser.io/git-diagrammer?diagramId=uttKbhgCgmbmLp8OF...

    Specific flow of an OCR request: https://eraser.io/git-diagrammer?diagramId=CX46d1Jy5Gsg3QDzP...

    (Disclaimer - uses a tool I've been working on)

  • lingjiekonga year ago
    Curious that have people find more details regarding what is the architecture of this "mistral-ocr-latest". I have two question that

    1. I was initially thinking this is VLM parsing model until I saw it can extract images. Then, I assume it is a pipeline of an image extraction and a VLM model while their result is combined to give the final result.

    2. In this case, benchmark the pipeline result vs a end to end VLM such as gemini 2.0 flash might not be apple to apple comparison.

  • paweldudaa year ago
    It outperforms the competition significantly AND can extract embedded images from the text. I really like LLMs for OCR more and more. Gemini was already pretty good at it
  • coolspota year ago
    This is $1 per 1000 pages. For comparison, Azure Document Intelligence is $1.5/1000 pages for general OCR and $30/1000 pages for “custom extraction”.
    • 0cf8612b2e1ea year ago
      Given the wide variety of pricing on all of these providers, I keep wondering how the economics work. Do they have fantastic margin on some of these products or is it a matter of subsidizing the costs, hoping to capture the market? Last I heard, OpenAI is still losing money.
  • srinathkrishnaa year ago
    Given the fact that multi-modal LLMs are getting so good at OCR these days, is it a shame that we can't do local OCR with high accuracy in the near-term?
  • strangescripta year ago
    I think its interesting they left out Gemini 2.0 Pro in the benchmarks which I find to be markedly better than flash if you don't mind the spend.
  • sureglymopa year ago
    Looks good but in the first hover/slider demo one can see how it could lead to confusion when handling side by side content.

    Table 1 is referred to in section `2 Architectural details` but before `2.1 Multimodal Decoder`. In the generated markdown though it is below the latter section, as if it was in/part of that section.

    Of course I am nitpicking here but just the first thing I noticed.

    • 0cf8612b2e1ea year ago
      Does anything handle dual columns well? Despite being the academic standard, it seemingly throws off every generic tool.
  • peterburkimshera year ago
    Does it work for video subtitles? And in Chinese? I’m looking to transcribe subtitles of live music recordings from ANHOP and KHOP.
  • th0ma5a year ago
    A great question for people wanting to use OCR in business is... Which digits in monetary amounts can you tolerate being incorrect?
    • a year ago
      undefined
  • rvza year ago
    > "Fastest in its category"

    Not one mention of the company that they have partnered with and that is Cerebras AI and that is the reason they have fast inference [0]

    Literally no-one here is talking about them and they are about to IPO.

    [0] https://cerebras.ai/blog/mistral-le-chat

  • soyyoa year ago
    I understand that is more juicy to get information from graphs, figures and so on, as every domain uses those, but i really hope to eventually see these models to be able to workout music notation, i have tried the best known apps and all of them fail to capture important details such as guitar performace symbols for bends or legato
  • robobena year ago
    Le chat doesn’t seem to know about this change despite the blog post stating it. Can anyone explain how to use it in Le Chat?
  • aperriena year ago
    Is this model open source?
    • daemonologista year ago
      No (nor is it open-weights).
      • a year ago
        undefined
  • low_tech_punka year ago
    This might be a contrarian take: the improvement against gpt-4o and gemini-1.5 flash, both of which are general purpose multi-modal models, seem to be underwhelming.

    I'm sensing another bitter lesson coming, where domain optimized AI will hold a short term advantage but will be outdated quickly as the frontier model advances.

  • submetaa year ago
    Is this able to convert pdf flowcharts into yaml or json representations of them? I have been experimenting with Claude 3.5. It has been very good at readig / understanding/ converting into representations of flow charts.

    So I am wondering if this is more capable. Will try definitely, but maybe someone can chime in.

  • simonwa year ago
    I built a CLI script for feeding PDFs into this API - notes on that and my explorations of Mistral OCR here: https://simonwillison.net/2025/Mar/7/mistral-ocr/
  • gatienboqueta year ago
    I feel like i can't create an agent with their OCR model yet ? Is it something planned or it's only API?
  • constantinuma year ago
    I see a lot of comments on hallucination risk and the accumulation of non-traceable rotten data. If you are curious to try a better non-llm-based OCR, try LLMWhisperer.https://pg.llmwhisperer.unstract.com/
  • jcuenoda year ago
    Just tested with a multilingual (bidi) English/Hebrew document.

    The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).

    Their benchmark results are impressive, don't get me wrong. But I'm a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I'm always looking for a better solution for a side-project I'm working on (fixpdfs.com).

    Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?

    • nicodjimeneza year ago
      You can get bounding boxes from our pdf api at Mathpix.com

      Disclaimer, I’m the founder

      • kergonatha year ago
        Mathpix is ace. That’s the best results I got so far for scientific papers and reports. It understands the layout of complex documents very well, it’s quite impressive. Equations are perfect, figures extraction works well.

        There are a few annoying issues, but overall I am very happy with it.

        • nicodjimeneza year ago
          Thanks for the kind words. What are some of the annoying issues?
          • kergonatha year ago
            I had a billing issue at the beginning. It was resolved very nicely but I try to be careful and I monitor the bill a bit more than I would like.

            Actually my main remaining technical issue is conversion to standard Markdown for use in a data processing pipeline that has issues with the Mathpix dialect. Ideally I’d do it on a computer that is airgaped for security reasons. But I haven’t found a very good way of doing it because the Python library wanted to check my API key.

            A problem I have and that is not really Mathpix’s fault is that I don’t really know how to store the figures pictures to keep them with the text in a convenient way. I haven’t found a very satisfying strategy.

            Anyway, keep up the good work!

  • piloocha year ago
    But what's the need exactly for OCR when you have multimodal LLMs that can read the same info and directly answer any questions about it ?

    For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type 'read exactly what's written in this document'.

    • simonwa year ago
      The biggest risk of vision LLMs for OCR is that they might accidentally follow instructions is the text that they are meant to be processing.

      (I asked Mistral if their OCR system was vulnerable to this and they said "should be robust, but curious to see if you find any fun examples" - https://twitter.com/simonw/status/1897713755741368434 and https://twitter.com/sophiamyang/status/1897719199595720722 )

      • piloocha year ago
        Fun, but LLMs would follow them post OCR anyways ;)

        I see OCR much like phonemes in speech, once you have end to end systems, they become latent constructs from the past.

        And that is actually good, more code going into models instead.

    • troyvita year ago
      Getting PDFs into #$@ Confluence apparently. Just had to do this and Mistral saved me a ton of hassle compared to this: https://community.atlassian.com/forums/Confluence-questions/...
    • daemonologista year ago
      It's useful to have the plain text down the line for operations not involving a language model (e.g. search). Also if you have a bunch of prompts you want to run it's potentially cheaper, although perhaps less accurate, to run the OCR once and save yourself some tokens or even use a smaller model for subsequent prompts.
    • ks2048a year ago
      Tons of uses: Storage (text instead of images), search (user typing in a text box and you want instant retrieval from a dataset), etc. And costs: run on images once - then the rest of your queries will only need to run on text.
  • albertha year ago
    Curious to see how this performance against more real world usage of someone taking a photo of text (which the text then becomes slightly blurred) and performing OCR on it.

    I can't exactly tell if the "Mistral 7B" image is an example of this exact scenario.

  • thomasahlea year ago
    I'm surprised they didn't benchmark it against Pixtral.

    They test it against a bunch of different Multimodal LLMs, so why not their own?

    I don't really see the purpose of the OCR form factor, when you have multimodal LLMs. Unless it's significantly cheaper.

  • 101008a year ago
    Is this free in LeChat? I uploaded a handwritten text and it stopped after the 4th word.
  • ein0pa year ago
    Could anyone suggest a tool which would take a bunch of PDFs (already OCR-d with Finereader), and replace the OCR overlay on all of them, maintaining the positions? I would like to have more accurate search over my document archive.
  • riffica year ago
    It'd be great if this could be tested against genealogical documents written in cursive like oh most of the documents on microfilm stored by the LDS on familysearch, or eastern european archival projects etc.
  • lokla year ago
    Tried with a few historical handwritten German documents, accuracy was abysmal.
    • lysacea year ago
      Semi-OT (similar language): The national archives in Sweden and Finland published a model for OCR:ing handwritten Swedish text from the 1600s to the 1800s with what to me seems like a very level of accuracy given the source material. (4% character error rate)

      https://readcoop.eu/model/the-swedish-lion-i/

      https://www.transkribus.org/success-story/creating-the-swedi...

      https://huggingface.co/Riksarkivet

      They have also published a fairly large volume of OCR:ed texts (IIRC birth/death notices from church records) using this model online. As a beginner genealogist it's been fun to follow.

    • Thaxlla year ago
      HTR ( Handwritten Text Recognition ) is a completely different space than OCR. What were you expecting exactly?
      • riquitoa year ago
        It fits the "use cases" mentioned in the article

        > Preserving historical and cultural heritage: Organizations and nonprofits that are custodians of heritage have been using Mistral OCR to digitize historical documents and artifacts, ensuring their preservation and making them accessible to a broader audience.

        • Thaxlla year ago
          There is a difference between historical document and "my doctor prescription".

          Someone coming here and saying it does not work with my old german hanwriting doesn't say much.

          • riquitoa year ago
            You're making a strawman, the parent specifically mentioned "historical handwritten documents"
    • butovchenkoya year ago
      For this task, general models will always perform poorly. My company trains custom gen ai models for document understanding. We recently trained a VLM for the German government to recognize documents written in old German handwriting, and it performed with exceptionally high accuracy.
    • rvnxa year ago
      Probably they are overfitting the benchmarks, since other users also complain of the low accuracy
    • thadta year ago
      Also working with historical handwritten German documents. So far Gemini seems to be the least wrong of the ones I've tried - any recommendations?
      • butovchenkoya year ago
        my recommendation is to train a custom model
    • anothermathbozoa year ago
      Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) are different tasks
  • dwedgea year ago
    Benchmarks look good. I tried this with a PDF that already has accurate PDF embedded just with new lines making pdftotext fail, and it was accurate for the text it found, but missed entire pages
  • Gnana year ago
    Is there an ocr with this kind of accuracy, but can run in a mobile device ? Looking for an ocr that can detect texts with high accuracy in realtime, so option of using cloud ocr is not viable.
  • monkeydusta year ago
    Spent time working on OCR problem many years ago for a mobile app. We found at the time that the preprocessing was so critical to the outcome (quality of image, angle, colour/greyscale)
  • joeevans1000a year ago
    I've found that the stunning OCR results so far were because the models were trained on the example file category. Is that the case here? Or can this recognize various documents?
    • noosphra year ago
      After four years of "This VLM will solve OCR forever!" posts I've firmly put VLMs in the "useless until 100t parameters" category.

      Some day when there is enough internal state and training data that they can recognize tables, images and text we will get a gpt3 like moment that will make regular OCR obsolete.

      But that day is very far off and everyone who I've talked with and consulted over using VLMs in their pipeline is better served doing something else specific to their use case.

      An example I've used before: https://imgur.com/a/w2XKObh

      The reply:

      ---

      To extract the text contained within each box while ignoring the text in sub-boxes, we can follow these steps:

      1. Identify the outermost box and extract its text. 2. Move to the next outermost box and extract its text, ignoring any text within its sub-boxes. 3. Continue this process for all boxes.

      Let's apply this to the image provided:

      1. The outermost box contains the text: "This should be second". 2. The next outermost box (ignoring the sub-box within it) contains the text: "First". 3. The next box contains the text: "And also this". 4. The final box contains the text: "The quick brown fox".

      So, the extracted text from each box, ignoring sub-boxes, is:

      1. "This should be second" 2. "First" 3. "And also this" 4. "The quick brown fox"

      ---

      As you can plainly see it is _wildly_ wrong and gives you no way to try and recover from those errors.

  • atemereva year ago
    So, the only thing that stopped AI from learning from all our science and taking over the world was the difficulty of converting PDFs of academic papers to more computer readable formats.

    Not anymore.

  • newfocogia year ago
    They say: "releasing the API mistral-ocr-latest at 1000 pages / $"

    I had to reread that a few times. I assume this means 1000pg/$1 but I'm still not sure about it.

    • dgfla year ago
      Great example of how information is sometimes compartmentalized arbitrarily in the brain: I imagine you have never been confused by sentences such as “I’m running at 10 km/h”.
      • mkla year ago
        Dollar signs go before the number, not after it like units. It needs to be 1000 pages/$1 to make sense, whereas 10km and 10h and 10/h all make sense so 10km/h does. I imagine you would be confused by km/h 10 but not $10.
        • ekwava year ago
          In the EU we put the € symbol after the number so it feels more natural to do that for $ as well.
    • svachaleka year ago
      Yeah you can read it as "pages per dollar" or as a unit "pages/$", it all comes out the same meaning.
    • ameliusa year ago
      Hmm, can it read small print? ;)
    • bredrena year ago
      Ya, presumably it is missing the number `1.00`.
      • groby_ba year ago
        Not really. When you go 60 mph (or km/h) you don't specify the 1.00 for the hours either. pages/$ is the unit, 1000 is the value.
        • bredrena year ago
          But you do for virtually all other cloud pricing pages.
  • polytelya year ago
    I don't need AGI just give me superhuman OCR so we can turn all existing pdfs into text* and cheaply host it.

    Feels like we are almost there.

    *: https://annas-archive.org/blog/critical-window.html

  • deadbabea year ago
    LLM based OCR is a disaster, great potential for hallucinations and no estimate of confidence. Results might seem promising but you’ll always be wondering.
    • utkarshphirkea year ago
      Absolutely right - we tried estimating LLM confidence and the results are not great. Any process that requires reliability will struggle with LLM OCR.

      https://news.ycombinator.com/item?id=43350816

    • menaerusa year ago
      CNN-based OCR also have "hallucinations" and Transformers aren't that much different in that respect. This is a problem solved with domain specific post-processing.
    • leumona year ago
      well already in 2013 ocr systems used in xerox scanners (turned on by default!) randomly altered numbers, so its not an issue only occuring in llms.
  • shmoogya year ago
    What's the general time for something like this to hit openrouter? I really hate having accounts everywhere when I'm trying to test new things.
  • thiago_fma year ago
    For general use this will be good.

    But I bet that simple ML will lead to better OCRs when you are doing anything specialized, such as, medical documents, invoices etc.

  • kccqzya year ago
    I have an actually hard OCR exercise for an AI model: I take this image of Chinese text on one of the memorial stones on the Washington Monument https://www.nps.gov/articles/american-mission-ningpo-china-2... and ask the model to do OCR. Not a single model I've seen can OCR this correctly. Mistral is especially bad here: it gets stuck in an endless loop of nonsensical hallucinated text. Insofar as Mistral is design for "preserving historical and cultural heritage" it couldn't do that very well yet.

    A good model can recognize that the text is written top to bottom and then right to left and perform OCR in that direction. Apple's Live Text can do that, though it makes plenty of mistakes otherwise. Mistral is far from that.

  • coolspota year ago
    This is $1 per 1000 pages.

    For comparison, Azure Document Intelligence is $1.5/1000 pages for general OCR and $30/1000 pages for “custom extraction”.

  • kinntha year ago
    This looks like a massive win if you were the NHS and had to scan and process old case notes.

    Same is true if you were a solicitors/lawyers.

  • a year ago
    undefined
  • jslezaka year ago
    Has anyone tried it for handwriting?

    So far Gemini is the only model I can get decent output from for a particular hard handwriting task

  • a year ago
    undefined
  • applgo443a year ago
    What's the simple explanation for why these VLM OCRs hallucinate but previous version of OCRs don't?
    • prats226a year ago
      Traditional OCR's usually have detection + recognition pipeline. So they will detect every word and then try to predict the text for every word. Errors obviously can happen in both parts, eg some words not detected which will get missed from output. Or word recognized incorrectly which is also common and more comparable to hallucination. However give its trained to work only on a small patch, accuracy is often higher. Comparing this to VLM's, they are looking at entire image/context and auto-regressively generating tokens/text which can also have lot of language bias, hence hallicinations.
  • thegabrielea year ago
    I'm using gemini to solve textual CAPTCHA with some good results (better than untrained OCR).

    I will give this a shot

  • dehrmanna year ago
    Is this burying the lede? OCR is a solved problem, but structuring document data from scans isn't.
  • a year ago
    undefined
  • nyeaha year ago
    It's not fair to call it a "Mistrial" just because it hallucinates a little bit.
    • wendyshua year ago
      Who called it that?
      • nyeaha year ago
        Well ... nobody. I read it wrong at first glance.
  • Zufriedenheita year ago
    How can I use these new OCR tools to make PDF files searchable by embedding the text layer?
  • anovicka year ago
    How does one use it to identify bounding rectangles of images/diagrams in the PDF?
  • OrvalWintermutea year ago
    I'm happy to see this development after being underwhelmed with Chatgpt OCR!
  • beebaweena year ago
    Wonder how it does with table data in pdfs / page-long tabular data?
  • cavisnea year ago
    Its funny how Gemini consistently beats googles dedicated document API.
    • jjicea year ago
      I'm not surprised honestly - it's just the newer better things vs their older offering
  • jhatemyjoba year ago
    As far as open source OCRs go, Tesseract is still the best, right?
  • d_llona year ago
    It's disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.

    1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.

    2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]

    3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)

    [1] https://medium.com/towards-data-science/digit-significance-i...

  • jbverschoora year ago
    Ohhh. Gonna test it out with some 100+ year old scribbles :)
    • jbverschoora year ago
      It did better than any other solution out there. However, I can only validate by the logic of the text. It's a recipe book.
  • WhitneyLanda year ago
    1. There’s no simple page / sandbox to upload images and try it. Fine, I’ll code it up.

    2. “Explore the Mistral AI APIs” (https://docs.mistral.ai) links to all apis except OCR.

    3. The docs on the api params refer to document chunking and image chunking but no details on how their chunking works?

    So much unnecessary friction smh.

    • cooperaustinja year ago
      There is an OCR page on the link you provided. It includes a very, very simple curl command (like most of their docs).

      I think the friction here exists outside of Mistral's control.

      • WhitneyLanda year ago
        How is it out of their control to document what they mean by chunking in their parameters?
      • kergonatha year ago
        > There is an OCR page on the link you provided.

        I don’t see it either. There might be some caching issue.

  • noloza year ago
    Are there any open source projects with the same goal?
  • t_seaa year ago
    They really went for it with the hieroglyphs opening.
  • ritvikpandey21a year ago
    as builders in this space, we decided to put it to the test on complex nested tables, pie charts, etc. to see if the same VLM hallucination issues persist, and to what degree. while results were promising, we found several critical failure nodes across two document domains.

    check out our blog post here! https://www.runpulse.com/blog/beyond-the-hype-real-world-tes...

  • linklater12a year ago
    Document processing is where b2b SAAS is at.
  • revskilla year ago
    Nextjs error is still uncauht correctly.
  • jwra year ago
    Alas, I can't run it locally. So it still doesn't solve the problem of OCR for my PDF archive containing my private data...
  • maCDzPa year ago
    Oh - on premise solution - awesome!
  • zelcona year ago
    Release the weights or buy an ad
  • sashank_1509a year ago
    Really cool, thanks Mistral!
  • rjurneya year ago
    What about tables in PDFs?
  • Zopieuxa year ago
    Saving you a click: no, it cannot be self hosted (unless you have a few million dollars laying around)
  • bugglebeetlea year ago
    Congrats to Mistral for yet again releasing another closed source thing that costs more than running an open source equivalent:

    https://github.com/DS4SD/docling

    • Squarexa year ago
      I am all for open source, but where do you see benchmarks that conclude that it's just equivalent?
      • bugglebeetlea year ago
        Where do you see open source benchmark results that confirm Mistral’s performance?
    • anonymousd3vila year ago
      Back in my days Mistral used to torrent models.
  • joeevans1000a year ago
    Can someone give me a tl&dr on how to start using this? Is this available if one signs up for a regular Mistral account?
  • kiratpa year ago
    It's shocking how much our industry fails to see past its own nose.

    Not a single example on that page is a Purchase Order, Invoice etc. Not a single example shown is relevant to industry at scale.

    • merba year ago
      Mistral is Europe based where invoices are more or less sent digitally in like 95% of all the cases anyway. Some are even digital invoices, which will at some point in the eu be mandatory. For orders there are proposals for that, too. And basically invoice data extraction is a different beast.
      • codetrottera year ago
        One use-case is digitising receipts from business related travels for expenses that employees paid for out of their own pocket and which they are submitting pictures to the business for reimbursement.

        Bus travels, meals including dinners and snacks, etc. for which the employee has receipts on paper.

        • _bc2za year ago
          Yeah, digitizing receipts is still a huge challenge for most companies, especially for expense reimbursements. Even though invoices are increasingly digital, employees still end up with physical receipts for work-related expenses. From what I've seen, there are some interesting contenders like Klippa that seem to solve exactly this problem [1].

          Curious to know if anyone heard of or used their OCR or a similar tool. Apparently it's not an LLM in disguise but an actual AI trained on gazillions of documents so the risk of hallucination might be lower than these LLM OCR solutions like Mistral.

          [1] https://www.klippa.com/en/ocr/ocr-api/

        • merba year ago
          Receipts are different. And they are harder to OCR. Thermo prints are most often aweful in quality. Most often you need to correct some stuff when dealing with them. I doubt that this tech changes that significantly.
      • revnodea year ago
        So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.

        Your best bet is to always convert it to an image and OCR it to extract structured data.

        • merba year ago
          This is simply not true. Maybe it’s easier and you do not need 100% precision. But it is actually possible to extract text and layout of digital pdfs. Else it would be impossible to display it. Of course some people still add image fragments to a pdf, but that practice is basically dying. I did not see a single pdf the last year we‘re it was impossible to extract the layout.
      • wolfi1a year ago
        even in Europe this is still a thing, I know of systems which still are unable to read items having more than one line (costing s sh*tload of money)
      • kiratpa year ago
        This isn't even close to true.

        Source: We have large EU customers.

        • merba year ago
          So your eu customer will send you the invoice via letters ? Wow. There are some companies that still deal with printed invoices, but they are most often smaller companies that deal with health related things.
          • kiratpa year ago
            Our EU customers use our technology to deal with all the invoices etc. they get sent as PDFs.
      • napoluxa year ago
        Can confirm, in Italy electronic invoicing is mandatory since 2019
    • simpaticodera year ago
      Another good example would be contracts of any kind. Imagine photographing a contract (like a car loan) and on the spot getting an AI to read it, understand it, forecast scenarious, highlight red flags, and do some comparison shopping for you.
      • JBiserkova year ago
        ... imagining ...

        ... hallucinating during read ...

        ... hallucinating during understand ...

        ... hallucinating during forecast ...

        ... highlighting a hallucination as red flag ...

        ... missing an actual red flag ...

        ... consuming water to cool myself...

        Phew, being an AI is hard!

        • simpaticodera year ago
          Your points are well-taken, but I think that contracts are a small enough, and well represented enough in the corpus, to actually be pretty solid. This is especially true with good prompting and some sort of feedback loop.
    • kashnotea year ago
      Fwiw, they have an example of a parking receipt in a cookbook: https://colab.research.google.com/github/mistralai/cookbook/...
    • sha16a year ago
      I wanted to apply OCR to my company's invoicing since they basically did purchasing for a bunch of other large companies, but the variability in the conversion was not tolerable. Even rounding something differently could catch an accountant's eye, let alone detecting a "8" as a "0" or worse.
    • guiomiea year ago
      Agreed. In general I've had such bad performance for complex table based invoice parsing, that every few months I try the latest models to see if its better. It does say "96.12" on top-tier benchmark under the Table category.
    • arpinuma year ago
      Businesses at scale use EDI to handle purchase orders and invoices, no OCR needed.
      • cdolana year ago
        Thats simply not a factual statement.

        Scaled businesses do USE edi, but they still receive hundreds of thousands of PDF documents a month

        source: built a saas product that handles pdfs for a specific industry

    • dotnetkowa year ago
      Agreed, though in this case, they are going for general-purpose OCR. That's fine in some cases, but purpose-built models trained on receipts, invoices, tax documents, etc., definitely perform better. We've got a similar API solution coming out soon (https://digital.abbyy.com/code-extract-automate-your-new-mus...) that should work better for businesses automating their docs at scale.
    • mtillmana year ago
      We find CV models to be better (higher midpoint on an ROC curve) for the types of docs you mention.
    • mentalgeara year ago
      To be fair: Reading the blog post, the main objective seems to have been to enable information extraction with high confidence for the academic sector (e.g. unlocking all these paper pdfs), and not necessarily to be another receipt scanner.
      • kiratpa year ago
        It hilarious that the academic sector 1. publishes as PDF 2. spends all this energy on how to extract that info back from PDF 3. publishes that research as PDF as well.

        Receipt scanning is a multiple orders of magnitude more valuable business. Mistral at this point is looking for a commercial niche (like how Claude is aiming at software development)

    • a year ago
      undefined
  • bondoloa year ago
    Such a shame that PDF doesn’t just, like, include the semantic structure of the document by default. It is brilliant that we standardized on an archival document format that doesn’t include direct access to the document text or structure as a core intrinsic default feature.

    I say this with great anger as someone who works in accessibility and has had PDF as a thorn in my side for 30 years.

    • NeutralForesta year ago
      I agree with this so much. I've tried to sometimes push friends and family to use text formats (at least I sent them something like Markdown), which is very easy to render in the browser anyways. But often you have to fall back to PDF, which I dislike very much. There's so much content like books and papers that are in PDF as well. Why did we pick a binary blob as shareable format again?
      • meatmaneka year ago
        > Why did we pick a binary blob as shareable format again?

        PDF was created to solve the problem of being able to render a document the same way on different computers, and it mostly achieved that goal. Editable formats like .doc, .html, .rtf were unreliable -- different software would produce different results, and even if two computers have the exact same version of Microsoft Word, they might render differently because they have different fonts available. PDFs embed the fonts needed for the document, and specify exactly where each character goes, so they're fully self-contained.

        After Acrobat Reader became free with version 2 in 1994, everybody with a computer ended up downloading it after running across a PDF they needed to view. As it became more common for people to be able to view PDFs, it became more convenient to produce PDFs when you needed everybody to be able to view your document consistently. Eventually, the ability to produce PDFs became free (with e.g. Office 2007 or Mac OS X's ability to print to PDF), which cemented PDF's popularity.

        Notably, the original goals of PDF had nothing to do with being able to copy text out of them -- the goal was simply to produce a perfect reproduction of the document on screen/paper. That wasn't enough of an inconvenience to prevent PDF from becoming popular. (Some people saw the inability for people to easily copy text from them as a benefit -- basically a weak form of text DRM.)

        • NeutralForesta year ago
          Thanks for the explanation! I was vaguely aware of those issues but not in depth. It all makes sense of course and now PDF is so deeply entrenched it's very difficult to push other formats. It's interesting that the contention between content and layout is still such an issue. I don't know what the fix is, maybe just the web?
    • cess11a year ago
      PDF is pretty strictly modeled on printed documents and their mainstream typography at the time of invention of Postscript and so on.

      Printed documents do not have any structure beyond the paper and placement of ink on them.

    • lukasba year ago
      Even assuming you could get people to do the work (probably the real issue here) could a single schema syntax capture the semantics of the universe of documents that exist as PDFs? PDFs succeeded because they could reproduce anything.
    • andaia year ago
      Tables? I regularly run into PDFs where even the body text is mangled!
    • lynx97a year ago
      [dead]
    • a year ago
      undefined
  • blackeyeblitzara year ago
    A similar but different product that was discussed on HN is OlmOCR from AI2, which is open source:

    https://news.ycombinator.com/item?id=43174298

  • sunami-aia year ago
    Making Transformers the same cost as CNN's (which are used in character-level ocr, as opposed to image-patch-level) is a good thing. The problem with CNN based character-level OCR is not the recognition models but the detection models. In a former life, I found a way to increase detection accuracy, and, therefore, overall OCR accuracy, and used that as an enhancement on top of Amazon and Google OCR. It worked really well. But the transformer approach is more powerful and if it can be done for $1 per 1000 pages, that is a game changer, IMO, at least of incumbents offering traditional character-level OCR.
    • menaerusa year ago
      It certainly isn't the same cost if expressed as a non-subsidized $$$ one needs for the Transformers compute aka infra.

      CNNs trained specifically for OCR can run in real time on as small compute as a mobile device is.

      • anon373839a year ago
        A bit of a tangent, but aren’t CNNs still dominating over ViTs among computer vision competition winners?
        • menaerusa year ago
          I haven't watched that space very closely but IMO ViTs have a great potential to extract from since in comparison to CNNs they allow the model to learn and understand complex relations in the data. Where this matters, I expect it to matter a lot. OCR I think is not the greatest such example - while it matters to understand the surrounding context, I think it's not that critical for performance.
  • a year ago
    undefined
  • mjnewsa year ago
    [dead]
  • a year ago
    undefined
  • cytocynca year ago
    [dead]
  • lynx97a year ago
    [dead]
  • fsfsdfadsa year ago
    [dead]
  • a year ago
    undefined
  • ronald263a year ago
    [dead]
  • Creator23a year ago
    [dead]
  • a year ago
    undefined
  • a year ago
    undefined
  • a year ago
    undefined
  • mjnewsa year ago
    [dead]
  • a year ago
    undefined
  • Creator56a year ago
    [dead]
  • ChrisArchitecta year ago
    [flagged]
    • vessenesa year ago
      No comments there yet - this at the top of the home page, let’s use this one.
  • meepmeepinatora year ago
    [flagged]
  • hyuuua year ago
    It's a weird timing because I just launched https://dochq.io - ai document extraction where you can define what you need to get out your documents in plain English, I legitimately thought that this was going to be such a niche product but hell, there has been a very rapid rise for AI-based OCR lately, an article/tweet even went viral 2 weeks ago I think? About using Gemini to do OCR, fun times.