LLM data prep is where a lot of model work quietly gets stuck.
DataFlow is an open-source data preparation and training system for generating, refining, evaluating, and filtering AI/LLM data from noisy sources like PDFs, plain text, and low-quality QA.
It helps you build reusable data workflows by turning cleaning and synthesis steps into operator-based pipelines you can reproduce, share, and extend.
Key features:
• Ready-to-use pipelines – covers text, reasoning, Text2SQL, knowledge-base cleaning, and Agentic RAG workflows
• Operator-based design – package generation, evaluation, filtering, and refinement steps into reusable pipeline components
• Custom operator support – create plug-and-play operators and distribute them through GitHub or PyPI
• WebUI option – run dataflow webui to build and execute pipelines through a visual interface
• Practical setup paths – install from PyPI with uv, use Colab, or run via Docker with GPU support
It’s open-source (Apache License 2.0).
Link in the reply 👇