Researchers built a RAG engine that:
- doesn't hallucinate sources
- understands document structure before chunking
- traces every answer to exact page and section
- syncs directly from Confluence, Notion, Google Drive, S3, Discord
- understands tables, scanned documents, and images inside PDFs
And it became the #1 open source project on GitHub in 2025.
Here's the core problem it solves:
Ask a typical RAG system about a liability buried in footnote 34 of a 200-page SEC filing. It returns whatever chunk looks most similar to your question. The actual answer, cross-referenced in an appendix, dependent on context from three sections earlier - never surfaces. Your LLM confidently hallucinates something that sounds right.
The problem isn't the model. It's everything that happened before the model saw anything.
RAGFlow was built around one principle: quality in, quality out.
They built DeepDoc, their own document understanding engine with OCR, table recognition, and layout analysis, because a scanned invoice, a 200-page SEC filing, and a table buried in slide 47 of a PowerPoint are not the same as plain text. Documents are understood before they are chunked. Structure is respected. Every answer traces back to exact page and section. You can see exactly why a specific answer was returned.
What this makes possible:
→ Knowledge bases from Confluence, S3, Notion, Discord, Google Drive
→ Agentic workflows with persistent memory across sessions
→ Multi-modal understanding of images inside PDFs and DOCX files
→ MCP integration for production agent pipelines
→ Template-based chunking with human intervention support
76.9K stars. 527 contributors. 2.5M Docker pulls.
🚀 RAGFlow × PaddleOCR-VL-1.5 — a powerful new integration for document RAG
PaddleOCR-VL-1.5 is now integrated into RAGFlow’s DeepDoc Parser, bringing stronger document understanding to the very first step of the RAG pipeline.
Why it stands out
🔹 Better parsing for scans, photos, distortion, and complex layouts
🔹 Polygon-level localization for more precise element detection
🔹 Cross-page table merging heading continuity for long documents
🔹 Visual citation grounding for more traceable and trustworthy retrieval
From messy PDFs to structured, citation-ready knowledge — now built directly into RAGFlow.
Learn more
👉PaddleOCR-VL-1.5:
github.com/PaddlePaddle/Padd…
👉RAGFlow:
github.com/infiniflow/ragflo…
👉Quick start:
ragflow.io#RAGFlow#PaddleOCR#RAG#DocumentAI
deployed deepdoc contract address is 0x1Ea753c628CAf12C24529977BA65aba0564f6bA3
view token: app.doppler.lol/tokens/base/…
note: due to platform restrictions on X, I couldn't transfer the fee beneficiary share to @tom_doerr (0xaea333d4f6f750985e931e1b74865794b3205672). the creator share (57%) is currently assigned to your wallet.
Hi people of @X,
I’m a software engineer diving into backend, ML/DL, and LLMOps.
Currently building DeepDoc-Ai, an AI powered document chat, analysis and comparison platform built on advanced RAG
If you’re on a similar path or want to follow the journey, let’s #connect.
Just wrapped up the retrieval & generation pipeline for the single-doc chat module in DeepDoc.
Also shipped the multi-doc chat module with ingestion, retriever & generation pipelines.
Tomorrow: diving deeper into reranking techniques and implementing them
#buildinpublic#rag
Had to park DeepDoc development for a while due to some tight deadlines on shipping a new feature at work.
Since yesterday, I’ve been spending some time building document compare module and single document chat module for DeepDoc
#buildinpublic
After chunk analysis, DeepDoc:
•Consolidates multiple summaries into a single narrative
•Picks the best title (heuristics: length, capitalization, relevance)
•Chooses valid authors/dates
•Detects language
•Merges page counts intelligently
New strategy : dynamic document handling
Instead of feeding in the whole doc all at once, DeepDoc now :
- Splits texts into chunks
- Runs metadata extraction per chunk
- Consolidates result into a clean JSON schema
Day 6 of #100DaysOfCode
- Solved a medium LeetCode : Valid Sudoku
- Explored #Structlog: processor chaining, JSON logging, console & file outputs
Built a custom logger for DeepDoc
#Python#DevLogs
Update on DeepDoc:
Decided upon the features:
- document analysis
- document comparison
- multidoc chat
- single doc chat
Implemented data ingestion and document comparator pipelines
Also a quick refresher on numpy operations, broadcasting
#llmops
Day 2 and 3 of #100DaysOfCode
Solved a medium @neetcode1 problem on group anagrams
Starting to learn LLMOps with a project based approach by building DeepDoc : an AI powered document chat, comparison and analysis platform
#rebootprotocol#llmops#LearnInPublic
RAGFlow 0.18 is released, highlights:
-Support MCP server.
-DeepDoc supports adopting VLM model as a processing pipeline.
-Support agent version control.
-Agents can be shared with team members.
-Enhanced conversation experiences.
More features here👉
github.com/infiniflow/ragflo…
RAGFlow 0.15 is released! Highlights:
-Upgrades doc layout analysis in DeepDoc
-Supports step run for Agent
-Supports resuming GraphRAG/RAPTOR from a failure, enhancing task management resilience
-Importing/Exporting agents in JSON
More features here👉:
github.com/infiniflow/ragflo…