troysk
In my experience, this works well but doesn't scale to all kinds of documents. For scientific papers; it can't render formulas. meta's nougat is the best model to do that. For invoices and records; donut works better. Both these models will fail in some cases so you end up running LLM to fix the issues. Even with that LLM won't be able to do tables and charts justice, as the details were lost during OCR process (bold/italic/other nuances). I feel these might also be "classical" methods. I have found vision models to be much better as they have the original document/image. Having prompts which are clear helps but still you won't get 100% results as they tend to venture off on their paths. I believe that can be fixed using fine tuning but no good vision model provides fine tuning for images. Google Gemini seems to have the feature but I haven't tried it. Few shots prompting helps keep the LLM from hallucinating, prompt injection and helps adhering to the format requested.
kelsey98765431
Fantastic work is emerging in this field, and with the new release of the schnell model of the flux series we will have the downstream captioning datasets we need to produce a new SOTA vision model, which has been the last straggler in the various open llm augmentations. Most vision models are still based on ancient CLIP/BLIP captioning and even with something like LLAVA or the remarkable phi-llava, we are still held back by the pretained vision components which have been needing love for some months now.

Tessy and LLM is a good pipe, it's likely what produced SCHNELL and will soon be the reverse of this configuration, used for testing and checking while the LLM does the bulk of transcription via vision modality adaption. The fun part of that is that multi lingual models will be able to read and translate, opening up new work for scholars searching through digitized works. Already I have had success in this area with no development at all, after we get our next SOTA vision models I am expecting a massive jump in quality. I expect english vision model adapters to show up using LLAVA architecture first, this may put some other latin script languages into the readable category depending on the adapted model, but we could see a leapfrog of scripts becoming readable all at once. LLAVA-PHI3 already seems to be able to transcribe tiny pieces of hebrew with relative consistency. It also has horrible hallucinations, so there is very much an unknown limiting factor here currently. I was planning some segmentation experiments but schnell knocked that out of my hands like a bar of soap in a prison shower, I will be waiting for a distilled captioning sota to come before I re-evaluate this area.

Exciting times!

jonathanyc
It's a very interesting idea, but the potential for hallucinations reminds me of JBIG2, a compression format which would sometimes substitute digits in faxed documents: https://en.wikipedia.org/wiki/JBIG2#Character_substitution_e...

> In 2013, various substitutions (including replacing "6" with "8") were reported to happen on many Xerox Workcentre photocopier and printer machines. Numbers printed on scanned (but not OCR-ed) documents had potentially been altered. This has been demonstrated on construction blueprints and some tables of numbers; the potential impact of such substitution errors in documents such as medical prescriptions was briefly mentioned.

> In Germany the Federal Office for Information Security has issued a technical guideline that says the JBIG2 encoding "MUST NOT be used" for "replacement scanning".

I think the issue is that even if your compression explicitly notes that it's lossy, or if your OCR explicitly states that it uses an LLM to fix up errors, if the output looks like it could have been created by an non-lossy algorithm, users will just assume it that was. So in some sense it's better to have obvious OCR errors when there's any uncertainty.

geraldog
This is a wonderful idea, but while I appreciate the venerable Tesseract I also think it's time to move on.

I personally use PaddlePaddle and have way better results to correct with LLMs.

With PPOCRv3 I wrote a custom Python implementation to cut books at word-level by playing with whitespace thresholds. It works great for the kind of typesetting found generally on books, with predictable whitespace threshold between words. This is all needed because PPOCRv3 is restricted to 320 x 240 pixels if I recall correctly and produces garbage if you downsample a big image and make a pass.

Later on I converted the Python code for working with the Rockchip RK3399Pro NPU, that is, to C. It works wonderfully. I used PaddleOCR2Pytorch to convert the models to rknn-api first and wrote the C implementation that cuts words on top of the rknn-api.

But with PPOCRv4 I think this isn't even needed, it's a newer architecture and I don't think it is bounded by pixel size restriction. That is, it will work "out of the box" so to speak. With the caveat that PPOCRv3 detection always worked better for me, PPOCRv4 detection model gave me big headaches.

janalsncm
Having tried this in the past, it can work pretty well 90% of the time. However, there are still some areas it will struggle.

Imagine you are trying to read a lease contract. The two areas which the LLM may be useless are numbers and names (names of people or places/addresses). There’s no way for your LLM to accurately know what the rent should be, or to know about the name of a specific person.

anonymoushn
Have you tried using other OCR packages? I had to give up on Tesseract after every mode and model I tried read a quite plain image of "77" as "7" (and interestingly the javascript port reads it as "11"). Pic related: https://i.postimg.cc/W3QkkhCK/speed-roi-thresh.png
kbyatnal
"real improvements came from adjusting the prompts to make things clearer for the model, and not asking the model to do too much in a single pass"

This is spot on, and it's the same as how humans behave. If you give a human too many instructions at once, they won't follow all of them accurately.

I spend a lot of time thinking about LLMs + documents, and in my opinion, as the models get better, OCR is soon going to be a fully solved problem. The challenge then becomes explaining the ambiguity and intricacies of complex documents to AI models in an effective way, less so about the OCR capabilities itself.

disclaimer: I run a LLM document processing company called Extend (https://www.extend.app/).

Oras
If anyone is looking to compare results visually, I have created an open source OCR visualiser to help identifying missing elements (especially in tables).

https://github.com/orasik/parsevision

jesprenj
> My original project had all sorts of complex stuff for detecting hallucinations and incorrect, spurious additions to the text (like "Here is the corrected text" preambles

> asks it to correct OCR errors

So, if I understand correctly, you add some prompt like "fix this text" and then the broken text?

Why don't you do it differently, by not using a chat model but instead a completion model and input the broken OCRd text in the model token by token and then get next token probabilities and then select the token that matches the original document as best as possible, maybe looking 3-5 tokens in advance?

Wouldn't this greatly decrease "hallucinations"?

I'm not trying to insult your approach, I'm just asking for your opinion.

123yawaworht456
when I was working with Tesseract, a particular issue I had was its tendency to parse a leading "+" as "4" about half the time. e.g. "+40% ROI" would get parsed as "440% ROI".

the font was perfectly fine, the screenshots were crispy PNGs.

A LLM can't really correct that. I appreciate that Tesseract exists, and it's mostly fine for non-serious things, but I wouldn't let it anywhere near critical data.

jmeyer2k
Love the idea! We're doing something similar to parse rubrics and student submissions at https://automark.io - great to see an open source library exploring the space more! Like you said, I think iteratively adding explicit layers of LLM understanding to the raw extraction will allow a lot more control over what information gets extracted. Also interested to see an integration with GPT-4V as an additional aid. I'd love to chat sometime if you have time - my email is in my bio.
aliosm
I'm working on Arabic OCR for a massive collection of books and pages (over 13 million pages so far). I've tried multiple open-source models and projects, including Tesseract, Surya, and a Nougat small model fine-tuned for Arabic. However, none of them matched the latency and accuracy of Google OCR.

As a result, I developed a Python package called tahweel (https://github.com/ieasybooks/tahweel), which leverages Google Cloud Platform's Service Accounts to run OCR and provides page-level output. With the default settings, it can process a page per second. Although it's not open-source, it outperforms the other solutions by a significant margin.

For example, OCRing a PDF file using Surya on a machine with a 3060 GPU takes about the same amount of time as using the tool I mentioned, but it consumes more power and hardware resources while delivering worse results. This has been my experience with Arabic OCR specifically; I'm not sure if English OCR faces the same challenges.

vunderba
I did something similar about a decade ago because I was using tesseract to OCR Chinese.

Part of the problem is that if you use Tesseract to recognize English text it's much easier to clean it up afterwards because if it makes a mistake it's usually in only a single character, and you can use Levenstein distance to spellcheck and fix which will help a lot with the accuracy.

Logographic languages such as Chinese present a particular challenge to "conventional post-processing" having many words represented as two characters and often a lot of words as a single "glyph". This is particularly difficult because if it gets that glyph wrong there's no way to obvious way to detect the identification error.

The solution was to use image magick to "munge" the image (scale, normalize, threshold, etc), send each of these variations to tesseract, and then use a Chinese-corpus based Markov model to score the statistical frequency of the recognized sentence and vote on a winner.

It made a significant improvement in accuracy.

katzinsky
Vision transformers are good enough that you can use them alone even on cursive handwriting. I've had amazing results with Microsoft's models and have my own little piece of wrapper software I use to transcribe blog posts I write in my notebook.
foota
I wonder if you could feed back the results from an LLM into the OCR model to get it to make better decisions. E.g., if it's distinguishing a 1 from an I, the LLM could provide a probability distribution.
__jl__
I think Gemini Flash 1.5 is the best closed-source model for this. Very cheap. Particularly compared to GPT4o-mini, which is priced the same as GPT4 for image input tokens. Performance and speed is excellent. I convert each pdf page to an image and send one request per page to Flash (asynchronously). The prompt asks for markdown output with specific formatting guidelines. For my application (mainly pdf slideshows with less text), the output is better than any of the dedicated tools I tested particularly for equations and tables.
x-yl
I'm curious if a multimodal model would be better at the OCR step than tesseract? Probably would increase the cost but I wonder if that would be offset by needing less post processing.
rasz
This is how you end up with "Xerox scanners/photocopiers randomly alter numbers in scanned documents" https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

I dont want hallucinations in places where OCR loses plot. I want either better OCR or error message telling me to repeat the scan.

Zambyte
Very cool! I have a hotkey to grab a region and pipe a screenshot through tesseract and then pipe that into my clipboard. I'll have to add on to it to pipe it though Ollama too :)
constantinum
Most of the document processing automation projects at an enterprise level require parsing of complex documents with tables, forms, handwriting, checkboxes, scanned documents. Example includes ACORD insurance forms, IRS tax forms and bank statements. I’m not even getting into how different each document can be even if they are of the same nature.

For any one curious on automating document processing end-to-end by leveraging llms do try Unstract. It is opens source.

https://github.com/Zipstack/unstract

Unstract also has a commercial version of document agnostic parser which you can channel to any RAG projects.

https://unstract.com/llmwhisperer/

jdthedisciple
Very recently we had Zerox [0] (Pdf -> Image -> GPT4o-mini based OCR) and I found it to work fantastically well)

Would be curious about comparisons between these.

[0] https://github.com/getomni-ai/zerox

itsadok
In your assess_output_quality function, you ask the LLM to give a score first, then an explanation. I haven't been following the latest research on LLMs, but I thought you usually want the explanation first, to get the model to "think out loud" before committing to the final answer. Otherwise, it might commit semi-radndomly to some score, and proceed to write whatever explanation it can come up with to justify that score.
scw
Exciting concept! Note that the LLM corrected version does drop a full paragraph from the output at the bottom of the second page (starting with an asterisk and "My views regarding inflationary possibilities". I'm not sure if there is a simple way to mitigate this risk but would be nice to fall back on uncorrected text if the LLM can't produce valid results for some region of the document.
jodosha
Thanks for sharing the info.

> where each chunk can go through a multi-stage process, in which the output of the first stage is passed into another prompt for the next stage

Is it made possible by your custom code or is this that now OpenAI offers off of the shelf via their API?

If the latter, that would partially replace LangChain for simple pipelines.

m-louis030195
That's cool! I use Tesseract, Whisper, and now Apple & Windows native OCR here:

https://github.com/louis030195/screen-pipe

And we also add LLM layers to fix errors

pennomi
I keep hoping someone at YouTube will do this for their autogenerated Closed Captioning. Nice work!
collinmcnulty
Does anyone have a solution that works well for handwriting? I have 10 years of handwritten notes that I’d love to make searchable but all OCR I’ve tried has been quite poor. These solutions seem focused on typeset documents.
TZubiri
Anyone remember that story about a bug in a scanner which scanned blueprints, and due to overzealous compression, got the measurements wrong with high definition?
axpy906
Interesting. I’d be curious if someone solved this at scale for good cost. Double call seems expensive to me when alternatives can do it in one but are still quite costly.
reissbaker
Even simpler, you can convert each PDF page to a PNG and ask gpt4 to simply transcribe the image. In my experience it's extremely accurate, more so than Tesseract or classic OCR.
echoangle
This assumes that input text actually is well formed, right? If I scan a page containing bogus text / typos, this will actually correct those mistakes in the output, right?
akamor
Does this work well on kvps and tables? That is where I typically have the most trouble with tesseract and where the cloud provider ocr systems really shine.
rafram
Cool stuff! I noticed that it threw away the footnote beginning with "My views regarding inflationary possibilities" in the example text, though.
dr_dshiv
I use Google lens for OCR 15th century Latin books — then paste to ChatGPT and ask to correct OCR errors. Spot checking, it is very reliable.

Then translation can occur

anothername12
We tried this. It’s no good for details like names, places, amounts, the interesting things etc. It will however fill in the gaps with made up stuff, which was rather infuriating.
esafak
I'd suggest measuring the word- and character error rates with and without the LLM. It'll let people quickly know how well it works.
simonw
Something that makes me nervous about this general approach is the risk of safety filters or accidental (or deliberate) instruction following interfering with the results.

I want to be able to run OCR against things like police incident reports without worrying that a safety filter in the LLM will refuse to process the document because it takes exception to a description of violence or foul language.

If a scanned document says "let's ignore all of that and talk about this instead" I want to be confident the LLM won't treat those as instructions and discard the first half of the text.

I'm always worried about prompt injection - what if a scanned document deliberately includes instructions to an LLM telling it to do something else?

Have you encountered anything like this? Do you have any measures in place that might prevent it from happening?

snats
Is there any goodmodel for OCR but on handwritten information? I feel like most models are currently kind of trash
wantsanagent
How does this compare in terms of speed, quality, and price to sending images to VLMs like GPT-4o or Claude 3.5?
sannysanoff
what are examples of local LLMs that accept images, that are mentioned in the README?
yding
Very cool!
Geoffellerton
[dead]
localfirst
Unfortunately LLM thrown at OCR doesn't work well with large enough to be useful from what I've been told.

Nothing I've seen here offers anything new to what was attempted

localfirst
paging Jason Huggins (https://news.ycombinator.com/user?id=hugs) to add his two cents to this discussion
anthonycarly13
[flagged]
nottorp
Hmm I know someone adding a nn based ocr to number plate recognition. In production. Why bring llms into this? Because all you have is a hammer?
fsndz
That's super useful, might be perfect fit for a RAG app with postgreSQL and pgvector: https://www.lycee.ai/courses/91b8b189-729a-471a-8ae1-717033c...