🚀 MASSIVE upgrades for UltraData data stack! The tiered data management (L0-L4) framework has now fully battle-tested on MiniCPM5-1B and is ready for your models! No gatekeeping, just pure data power.
What’s NEW in our latest release:👇
✅ Ultra-FineWeb-L3 — 600B tokens (200B Chinese, 400B English) of high-density synthetic pre-training data, which expanded from Ultra-FineWeb via multi-style rewriting & QA generation, and has used in MiniCPM5-1B's decay stage.
🤗
huggingface.co/datasets/open…
✅ UltraData-SFT-2605 — 15M post-training samples across math, code, knowledge & instruction following, with deep-thinking and non-thinking training styles, used in MiniCPM5-1B's SFT stage.
🤗
huggingface.co/datasets/open…