Show HN: Zerox – Document OCR with GPT-mini

serjester

It should be noted for some reason OpenAI prices GPT-4o-mini image requests at the same price as GPT-4o. I have a similar library but we found OpenAI has subtle OCR inconsistencies with tables (numbers will be inaccurate). Gemini Flash, for all its faults, seems to do really well as a replacement while being significantly cheaper.

Here’s our pricing comparison:

*Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text output (batch API, 1k tokens) - 395 pages per dollar

*Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text output (batch API, 1k tokens) - 1693 pages per dollar

*GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output (batch API, 1k tokens) - 177 pages per dollar

*GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text output (batch API, 1k tokens) - 452 pages per dollar

[1] https://community.openai.com/t/super-high-token-usage-with-g...

[2] https://github.com/Filimoa/open-parse

8organicbits

I'm surprised by the name choice, there's a large company with an almost identical name that has products that do this. May be worth changing it sooner rather than later.

https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web

hugodutka

I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.

jerrygenser

Azure document AI accuracy I would categorize as high not "mid". Including hand writing. However for the $1.5/1000 pages, it doesn't include layout detection.

The $10/1000 pages model includes layout detection (headers, etc.) as well as key-value pairs and checkbox detection.

I have continued to do proofs of concept with Gemini and GPT, and in general any new multimodal model that comes out but have it is not on par with the checkbox detection of azure.

In fact the results from Gemini/GPT4 aren't even good enough to use as a teacher for distillation of a "small" multimodal model specializing in layout/checkbox.

I would like to also shout out surya OCR which is up and coming. It's source available and free for under a certain funding or revenue milestone - I think $5m. It doesn't have word level detection yet but it's one of the more promising non-hyper scaler/ heavy commercial OCR tools I'm aware of.

ndr_

Prompts in the background:

  const systemPrompt = `
    Convert the following PDF page to markdown. 
    Return only the markdown with no explanation text. 
    Do not exclude any content from the page.
  `;

For each subsequent page: messages.push({ role: "system", content: `Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`, });

Could be handy for general-purpose frontend tools.

beklein

Very interesting project, thank you for sharing.

Are you supporting the Batch API from OpenAI? This would lower costs by 50%. Many OCR tasks are not time-sensitive, so this might be a very good tradeoff.

surfingdino

Xerox tried it a while ago. It didn't end well https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

bearjaws

I did this for images using Tesseract for OCR + Ollama for AI.

Check it out, https://cluttr.ai

Runs entirely in browser, using OPFS + WASM.

constantinum

If you want to do document OCR/PDF text extraction with decent accuracy without using an LLM, do give LLMWhisperer[1] a try.

Try with any PDF document in the playground - https://pg.llmwhisperer.unstract.com/

[1] - https://unstract.com/llmwhisperer/

binalpatel

You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):

    You are an expert in PDFs. You are helping a user extract text from a PDF.

    Extract the text from the image as a structured json output.

    Extract the data using the following schema:

    {Page.model_json_schema()}

    Example:
    {{
      "title": "Title",
      "page_number": 1,
      "sections": [
        ...
      ],
      "figures": [
        ...
      ]
    }}

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/

amluto

My intuition is that the best solution here would be a division of labor: have the big multimodal model identify tables, paragraphs, etc, and output a mapping between segments of the document and texture output. Then a much simpler model that doesn’t try to hold entire conversations can process those segments into their contents.

This will perform worse in cases where whatever understanding the large model has of the contents is needed to recognize indistinct symbols. But it will avoid cases where that very same understanding causes contents to be understood incorrectly due to the model’s assumptions of what the contents should be.

At least in my limited experiments with Claude, it’s easy for models to lose track of where they’re looking on the page and to omit things entirely. But if segmentation of the page is explicit, one can enforce that all contents end up in exactly one segment.

aman2k4

I am using AWS Textract + LLM (OpenAI/Claude) to read grocery receipts for <https://www.5outapp.com>

So far, I have collected over 500 receipts from around 10 countries with 30 different supermarkets in 5 different languages.

What has worked for me so far is having control over OCR and processing (for formatting/structuring) separately. I don't have the figures to provide a cost structure, but I'm looking for other solutions to improve both speed and accuracy. Also, I need to figure out a way to put a metric around accuracy. I will definitely give this a shot. Thanks a lot.

refulgentis

Fwiw have on good sourcing that OpenAI supplies Tesseract output to the LLM, so you're in a great place, best of all worlds

lootsauce

In my own experiments I have had major failures where much of the text is fabricated by the LLM to the point where I just find it hard to trust even with great prompt engineering. What I have been very impressed with is it’s ability to take medium quality ocr from acrobat with poor formatting, lots of errors and punctuation problems and render 100% accurate and properly formatted output by simply asking it to correct the ocr output. This approach using traditional cheap ocr for grounding might be a really robust and cheap option.

jimmyechan

Congrats! Cool project! I’d been curious about whether GPT would be good for this task. Looks like this answers it!

Why did you choose markdown? Did you try other output formats and see if you get better results?

Also, I wonder how HMTL performs. It would be a way to handle tables with groupings/merged cells

josefritzishere

Xerox might want to have a word with you about that name.

ReD_CoDE

It seems that there's a need for a benchmark to compare all solutions available in the market based on the quality and price

The majority of comments are related to prices and qualities

Also, is there any movements about product detection? These days I'm looking for solutions that can recognize goods in high accuracy and show [brand][product_name][variant]

samuell

The problem I've not found one OCR solution to handle well is complex column based layouts in magazines. Perhaps one problem is that there are often images spanning anything from one to all columns, and so the text might flow in sometimes funny ways. But in this day and age, this must be possible to handle for the best AI-based tools?

jagermo

ohh, that could finally be a great way to get my ttrpg books readable for kindle. I'll give it a try, thanks for that.

8organicbits

> And 6 months from now it'll be fast, cheap, and probably more reliable!

I like the optimism.

I've needed to include human review when using previous generation OCR software; when I needed the results to be accurate. It's painstaking, but the OCR offered a speedup over fully-manual transcription. Have you given any thought to human-in-the-loop processes?

downrightmike

Does it also produce a confidence number?

Dkuku

Check gpt-4o, gpt-4o-mini uses around 20 times more tokens for the same image: https://youtu.be/ZWxBHTgIRa0?si=yjPB1FArs2DS_Rc9&t=655

ravetcofx

I'd be more curious to see the performance over local models like LLaVa etc.

ipkstef

I think i'm missing something.. why would i pay to ocr the images when i can do it locally for free? Tesseract runs pretty well on just cpu, wouldn't even need something crazy powerful.

cmpaul

Great example of how LLMs are eliminating/simplifying giant swathes of complex tech.

I would love to use this in a project if it could also caption embedded images to produce something for RAG...

throwthrowuknow

Have you compared the results to special purpose OCR free models that do image to text with layout? My intuition is mini should be just as good if not better.

jdthedisciple

Very nice, seem to work pretty well!

Just

    maintainFormat: true

did not seem to have any effect in my testing.

fudged71

Llama 3.1 now has images support right? Could this be adapted there as well, maybe with groq for speed?

daft_pink

I would really love something like this that could be run locally.

murmansk

Man, this is just an awesome hack! Keep it up!