Converts PDF documents to Markdown format using DeepSeek-OCR with FastAPI backend.
This guy bave it 10000 pdfs to convert to markdown.
averaging less than 1 second per page.
Hardware - 1 x A6000 ADA on a Ryzen 1700 /w 32gb ram
Dockerized model with fastapi in a wsl environment.
👨‍🔧 Inside the smart design of DeepSeek OCR
DeepSeek-OCR looks like just another OCR model at first glance, something that reads text from images. But it’s not just that.
What they really built is a new way for AI models to store and handle information.
Normally, when AI reads text, it uses text tokens (the units that LLMs process). Each word or part of a word becomes a token. When text gets long, the number of tokens explodes, and this makes everything slower and more expensive because the model’s computation cost grows roughly with the square of the number of tokens. That’s why even the most advanced models struggle with very long documents.
💡DeepSeek’s core idea was simple but revolutionary:
Instead of feeding an LLM thousands of text tokens, it turns long text into an image, encodes that image into a small set of vision tokens, then lets a decoder reconstruct the text.
The team asked a simple question, how many vision tokens are minimally needed to decode N text tokens, and they measured it end to end. The paper reports about 97% OCR precision when compressing text by 9–10x, and about 60% precision even at 20x.
This shows that dense visual representations can carry the same information far more efficiently than plain text tokens.
The engineering that makes this practical is a new encoder called DeepEncoder. It processes high resolution pages without blowing up memory by doing local window attention first, then a 16x convolutional downsampler, then global attention.
That serial design keeps activations small while aggressively cutting token count.
So why this is a big deal?
Context is the currency of LLMs, and it is expensive.
If visual tokens can represent past dialogue, documents, or code at 10x smaller size with high fidelity, you can keep far more context active, cut costs, and speed up inference.
The paper also sketches a practical “forgetting” mechanism, you can progressively downscale older context images so recent information stays sharp while older context becomes cheaper over time, which matches how human memory fades.
This makes long running assistants, RAG replacements, and whole codebase in context workflows much more realistic.