🚨BREAKING: Someone just open-sourced a tool that converts PDFs to markdown at 100 pages per second. 100% FREE.
Runs entirely on CPU. No expensive GPUs needed.
No cloud.
It's called OpenDataLoader PDF.
Give it any PDF - scanned documents, scientific papers, multi-column reports, complex tables - and it converts everything into clean Markdown, JSON with bounding boxes, or HTML. Ready to feed straight into any AI pipeline.
Not a wrapper around someone else's OCR. Not a basic text extractor. A full document intelligence engine that understands layout, reading order, headings, tables, and formulas.
Here's what this thing can do:
→ Extracts text in the correct reading order across multi-column layouts
→ Pulls complex borderless tables with 0.93 accuracy — highest of any open-source parser
→ Detects heading hierarchy, nested lists, and document structure automatically
→ Runs OCR on scanned PDFs in 80 languages including Chinese, Arabic, Korean, and Japanese
→ Extracts math formulas as LaTeX from scientific papers
→ Gives you bounding boxes for every single element on the page
→ Describes charts and images using a built-in vision model
→ Filters prompt injections and hidden text - built-in AI safety that no other parser has
Here's why every existing tool loses:
They benchmarked it against 200 real-world PDFs including scientific papers and multi-column documents. OpenDataLoader scored 0.90 overall. Docling scored 0.86. Marker scored 0.83 but takes 54 seconds per page. MinerU scored 0.82 at 6 seconds per page.
OpenDataLoader local mode? 0.05 seconds per page. That is over 1,000x faster than Marker at nearly the same accuracy.
Here's the wildest part:
It has two modes. Local mode runs pure Java — 20 pages per second on a basic CPU. Hybrid mode adds an AI backend for complex pages and scores #1 in every category. Run it on an 8-core machine with batch processing and you hit 100 pages per second.
Your documents never leave your machine. Zero API calls. Zero data transmission. 100% local.
It even has a built-in AI safety layer that catches hidden text, transparent fonts, and off-page content that other parsers silently pass through to your LLM.
One command to install:
pip install -U opendataloader-pdf
Works with Python, Node.js, and Java. Official LangChain integration included.
3.3K GitHub stars. 478 commits. 51 releases. 13 contributors. Actively maintained.
100% Open Source. Apache 2.0 License.