Joined September 2015
804 Photos and videos
Pinned Tweet
May 7
Today weโ€™re bringing new NSF OMAI compute online with NVIDIA Blackwell Ultra-powered systems, turning a $152M national investment from @NSF & @NVIDIA into a foundation for truly open AI research. ๐Ÿงต
6
21
129
289,170
Jun 12
Building an LLM means evaluating it over & over as it changes. Tweak a hyperparameter or scale the model up, & every new checkpoint sends you back through the same benchmarking loop. We're releasing olmo-eval, a workbench built for this kind of iterative model development. ๐Ÿงต
2
11
70
3,685
Jun 12
After training a model with a new intervention, olmo-eval lets you line two model checkpoints up question by questionโ€”holding everything else fixed. The comparison view makes it easier to see real gains & regressions.
1
5
722
Jun 12
If you find yourself asking "how does this model checkpoint differ from the last, and where did it improve/regress?", that's what olmo-eval is for. We're releasing it openly so the community can build on it. ๐Ÿ’ป Code: github.com/allenai/olmo-eval ๐Ÿ“ Blog: allenai.org/blog/olmo-eval
1
7
732
Ai2 retweeted
So impactful! Excellent work from @sewon__min et al! Alternatively, it points to so much about โ€œnoveltyโ€ of a generation and trace its history: clearly critical for scientific discovery with these models!
Jun 11
LLMs are no longer created w/ human data alone. They rely on other models to generate & filter data, evaluate outputs, & guide dev work. So what is a modern LLM built on? Olmo 3 โ†’ 89 model 183 dataset dependencies; Nemotron 3 โ†’ 273 560 We made ModSleuth to trace this. ๐Ÿงต
1
14
2,743
Ai2 retweeted
One day I tried tracing all of Olmo's dependencies manually. A few hours later, I realized I can't do it and gave up. Then @sadhikesaven and @CoderBak ModSleuth ๐Ÿ”ฅ Turns out Olmo and Nemotron have hundreds of dependencies that are super deep, recursive, and not easily visible. I'm glad I gave up early ๐Ÿ˜… Spoiler: I thought this would be a one-week Claude Code project. It was not. The hard part wasn't information extraction (which Claude Code is good at). The hard part was something much trickier. Check out the paper to learn more! (And yes, if a model release says it used Claude Code, ModSleuth will trace that too... which means the model depends on Claude Code, which has its own dependencies, and ModSleuth itself depends on Claude Code ๐Ÿคฏ)
Jun 11
LLMs are no longer created w/ human data alone. They rely on other models to generate & filter data, evaluate outputs, & guide dev work. So what is a modern LLM built on? Olmo 3 โ†’ 89 model 183 dataset dependencies; Nemotron 3 โ†’ 273 560 We made ModSleuth to trace this. ๐Ÿงต
2
21
145
22,251
Jun 11
LLMs are no longer created w/ human data alone. They rely on other models to generate & filter data, evaluate outputs, & guide dev work. So what is a modern LLM built on? Olmo 3 โ†’ 89 model 183 dataset dependencies; Nemotron 3 โ†’ 273 560 We made ModSleuth to trace this. ๐Ÿงต
5
41
250
85,450
Jun 11
ModSleuth generates a graph that surfaces what's nearly impossible to find manually, including: ๐Ÿ“œ Hidden license inheritance ๐Ÿ”— Train/eval coupling ๐Ÿ“ Documentation inconsistencies ๐Ÿค– Models used as judges, filters, OCR systems, & data generators
1
7
1,459
Jun 11
As LLM pipelines become more complex, we need tools like ModSleuth to find out & identify what artifacts models are built on. โ–ถ๏ธ Demo: modsleuth.cal-data-audit.org ๐Ÿ“„ Paper: arxiv.org/abs/2606.12385

1
17
1,376
Ai2 retweeted
๐—”๐—–๐—˜๐Ÿฎ๐—ฆ-๐—ฆ๐—›๐—ถ๐—˜๐—Ÿ๐—— , our new climate emulator that learns to separate the effects of sea surface temperature & CO2, is now on @huggingfaceโ€”check it out โ†’ huggingface.co/allenai/ACE2Sโ€ฆ
Jun 9
Today we're introducing ACE2S-SHiELD , a climate emulator that learns to separate the effects of sea surface temperature & CO2. It accurately handles scenarios where previous versions of our ACE family of climate emulators produced inaccurate results. ๐Ÿงต
2
4
2,631
Ai2 retweeted
I'm hiring Senior Research Engineers. Come build open-source vision-language models from zero to hero: pretraining, mid-training, post-training, RL, the whole pipeline. job-boards.greenhouse.io/theโ€ฆ
2
14
168
21,515
Jun 9
Today we're introducing ACE2S-SHiELD , a climate emulator that learns to separate the effects of sea surface temperature & CO2. It accurately handles scenarios where previous versions of our ACE family of climate emulators produced inaccurate results. ๐Ÿงต
1
7
25
5,846
Jun 9
Trained on the new & existing data, ACE2S-SHiELD accurately handles the scenarios earlier ACE models were good at as well as the ones they struggled with. It's more flexible than ACE2-SHiELD ACE2-SOM combined, using ~25% fewer training samples than either alone.
1
3
683