๐ MASSIVE upgrades for UltraData data stack! The tiered data management (L0-L4) framework has now fully battle-tested on MiniCPM5-1B and is ready for your models! No gatekeeping, just pure data power.
Whatโs NEW in our latest release:๐
โ
Ultra-FineWeb-L3 โ 600B tokens (200B Chinese, 400B English) of high-density synthetic pre-training data, which expanded from Ultra-FineWeb via multi-style rewriting & QA generation, and has used in MiniCPM5-1B's decay stage.
๐ค
huggingface.co/datasets/openโฆ
โ
UltraData-SFT-2605 โ 15M post-training samples across math, code, knowledge & instruction following, with deep-thinking and non-thinking training styles, used in MiniCPM5-1B's SFT stage.
๐ค
huggingface.co/datasets/openโฆ