๐ญ Science relies on shared artifacts collected for the common good.
๐ฐ So we asked: what's missing in open language modeling?
๐ช DataDecide ๐ charts the cosmos of pretrainingโacross scales and corporaโat a resolution beyond any public suite of models that has come before.
Ever wonder how LLM developers choose their pretraining data? Itโs not guessworkโ all AI labs create small-scale models as experiments, but the models and their data are rarely shared.
DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks ๐งต