Tessy and LLM is a good pipe, it's likely what produced SCHNELL and will soon be the reverse of this configuration, used for testing and checking while the LLM does the bulk of transcription via vision modality adaption. The fun part of that is that multi lingual models will be able to read and translate, opening up new work for scholars searching through digitized works. Already I have had success in this area with no development at all, after we get our next SOTA vision models I am expecting a massive jump in quality. I expect english vision model adapters to show up using LLAVA architecture first, this may put some other latin script languages into the readable category depending on the adapted model, but we could see a leapfrog of scripts becoming readable all at once. LLAVA-PHI3 already seems to be able to transcribe tiny pieces of hebrew with relative consistency. It also has horrible hallucinations, so there is very much an unknown limiting factor here currently. I was planning some segmentation experiments but schnell knocked that out of my hands like a bar of soap in a prison shower, I will be waiting for a distilled captioning sota to come before I re-evaluate this area.
Exciting times!
> In 2013, various substitutions (including replacing "6" with "8") were reported to happen on many Xerox Workcentre photocopier and printer machines. Numbers printed on scanned (but not OCR-ed) documents had potentially been altered. This has been demonstrated on construction blueprints and some tables of numbers; the potential impact of such substitution errors in documents such as medical prescriptions was briefly mentioned.
> In Germany the Federal Office for Information Security has issued a technical guideline that says the JBIG2 encoding "MUST NOT be used" for "replacement scanning".
I think the issue is that even if your compression explicitly notes that it's lossy, or if your OCR explicitly states that it uses an LLM to fix up errors, if the output looks like it could have been created by an non-lossy algorithm, users will just assume it that was. So in some sense it's better to have obvious OCR errors when there's any uncertainty.
I personally use PaddlePaddle and have way better results to correct with LLMs.
With PPOCRv3 I wrote a custom Python implementation to cut books at word-level by playing with whitespace thresholds. It works great for the kind of typesetting found generally on books, with predictable whitespace threshold between words. This is all needed because PPOCRv3 is restricted to 320 x 240 pixels if I recall correctly and produces garbage if you downsample a big image and make a pass.
Later on I converted the Python code for working with the Rockchip RK3399Pro NPU, that is, to C. It works wonderfully. I used PaddleOCR2Pytorch to convert the models to rknn-api first and wrote the C implementation that cuts words on top of the rknn-api.
But with PPOCRv4 I think this isn't even needed, it's a newer architecture and I don't think it is bounded by pixel size restriction. That is, it will work "out of the box" so to speak. With the caveat that PPOCRv3 detection always worked better for me, PPOCRv4 detection model gave me big headaches.
Imagine you are trying to read a lease contract. The two areas which the LLM may be useless are numbers and names (names of people or places/addresses). There’s no way for your LLM to accurately know what the rent should be, or to know about the name of a specific person.
This is spot on, and it's the same as how humans behave. If you give a human too many instructions at once, they won't follow all of them accurately.
I spend a lot of time thinking about LLMs + documents, and in my opinion, as the models get better, OCR is soon going to be a fully solved problem. The challenge then becomes explaining the ambiguity and intricacies of complex documents to AI models in an effective way, less so about the OCR capabilities itself.
disclaimer: I run a LLM document processing company called Extend (https://www.extend.app/).
> asks it to correct OCR errors
So, if I understand correctly, you add some prompt like "fix this text" and then the broken text?
Why don't you do it differently, by not using a chat model but instead a completion model and input the broken OCRd text in the model token by token and then get next token probabilities and then select the token that matches the original document as best as possible, maybe looking 3-5 tokens in advance?
Wouldn't this greatly decrease "hallucinations"?
I'm not trying to insult your approach, I'm just asking for your opinion.
the font was perfectly fine, the screenshots were crispy PNGs.
A LLM can't really correct that. I appreciate that Tesseract exists, and it's mostly fine for non-serious things, but I wouldn't let it anywhere near critical data.
As a result, I developed a Python package called tahweel (https://github.com/ieasybooks/tahweel), which leverages Google Cloud Platform's Service Accounts to run OCR and provides page-level output. With the default settings, it can process a page per second. Although it's not open-source, it outperforms the other solutions by a significant margin.
For example, OCRing a PDF file using Surya on a machine with a 3060 GPU takes about the same amount of time as using the tool I mentioned, but it consumes more power and hardware resources while delivering worse results. This has been my experience with Arabic OCR specifically; I'm not sure if English OCR faces the same challenges.
Part of the problem is that if you use Tesseract to recognize English text it's much easier to clean it up afterwards because if it makes a mistake it's usually in only a single character, and you can use Levenstein distance to spellcheck and fix which will help a lot with the accuracy.
Logographic languages such as Chinese present a particular challenge to "conventional post-processing" having many words represented as two characters and often a lot of words as a single "glyph". This is particularly difficult because if it gets that glyph wrong there's no way to obvious way to detect the identification error.
The solution was to use image magick to "munge" the image (scale, normalize, threshold, etc), send each of these variations to tesseract, and then use a Chinese-corpus based Markov model to score the statistical frequency of the recognized sentence and vote on a winner.
It made a significant improvement in accuracy.
https://unstract.com/llmwhisperer/
https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...
I dont want hallucinations in places where OCR loses plot. I want either better OCR or error message telling me to repeat the scan.
For any one curious on automating document processing end-to-end by leveraging llms do try Unstract. It is opens source.
https://github.com/Zipstack/unstract
Unstract also has a commercial version of document agnostic parser which you can channel to any RAG projects.
Would be curious about comparisons between these.
> where each chunk can go through a multi-stage process, in which the output of the first stage is passed into another prompt for the next stage
Is it made possible by your custom code or is this that now OpenAI offers off of the shelf via their API?
If the latter, that would partially replace LangChain for simple pipelines.
https://github.com/louis030195/screen-pipe
And we also add LLM layers to fix errors
Then translation can occur
I want to be able to run OCR against things like police incident reports without worrying that a safety filter in the LLM will refuse to process the document because it takes exception to a description of violence or foul language.
If a scanned document says "let's ignore all of that and talk about this instead" I want to be confident the LLM won't treat those as instructions and discard the first half of the text.
I'm always worried about prompt injection - what if a scanned document deliberately includes instructions to an LLM telling it to do something else?
Have you encountered anything like this? Do you have any measures in place that might prevent it from happening?
Nothing I've seen here offers anything new to what was attempted