language models, mts @MicrosoftAI, ex co-lead of Olmo @allen_ai, @uwcse, he/him, kylelo.bsky.social🧋

Joined January 2019
85 Photos and videos
Pinned Tweet
17 Dec 2025
olmo 3 paper finally on arxiv 🫡 thx to our teammates esp folks who chased additional baselines thx to arxiv-latex-cleaner and overleaf feature for chasing latex bugs thx for all the helpful discussions after our Nov release, best part of open science is progressing together!
15
72
441
56,863
Kyle Lo retweeted
microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale. this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab. the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale let's look at all of this in this likely very long thread 🧵
Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier. First is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. - It’s a 35B active parameter MoE with a 256K context window. Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6, and it’s achieved 97% on AIME 2025, the key measure of its general-purpose reasoning abilities. - It's at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks. - And since we co-designed our models with our own silicon, MAI-Thinking-1 is optimized on our MAIA 200 chip. Benchmarking head-to-head against the GB200, we see 30% better performance per dollar as well as a 1.4x performance-per-watt gain when running our MAI models on the MAIA 200 end-to-end. Next is MAI-Image-2.5 and its Flash variant. Two super strong models now at #2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing. Last for now is MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI. - Code-1-Flash achieves 51% on SWE Bench Pro, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost. All of this is the foundation for Microsoft Frontier Tuning. It lets you customize our models to create custom, company-specific agents that only you control. You can make our model, your model. Your data. Your agents. Your moat. Early adopters are already seeing a difference. When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, while being 10x lower on cost. Also really excited to be collaborating with the amazing team at Mayo Clinic to jointly train a new frontier AI model for healthcare. Our announcements today mark another milestone on the road to humanist superintelligence. You can learn more and about our other new models in our latest blog: microsoft.ai/news/building-a…
42
267
2,088
283,508
happy to share another quality tech report w/ the wider research community 🫶 great read for ppl who want to see all the details for methods infra for scaling up pretraining & RL, esp detailed discussion about data which is often kept vague by other labs
Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier. First is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. - It’s a 35B active parameter MoE with a 256K context window. Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6, and it’s achieved 97% on AIME 2025, the key measure of its general-purpose reasoning abilities. - It's at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks. - And since we co-designed our models with our own silicon, MAI-Thinking-1 is optimized on our MAIA 200 chip. Benchmarking head-to-head against the GB200, we see 30% better performance per dollar as well as a 1.4x performance-per-watt gain when running our MAI models on the MAIA 200 end-to-end. Next is MAI-Image-2.5 and its Flash variant. Two super strong models now at #2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing. Last for now is MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI. - Code-1-Flash achieves 51% on SWE Bench Pro, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost. All of this is the foundation for Microsoft Frontier Tuning. It lets you customize our models to create custom, company-specific agents that only you control. You can make our model, your model. Your data. Your agents. Your moat. Early adopters are already seeing a difference. When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, while being 10x lower on cost. Also really excited to be collaborating with the amazing team at Mayo Clinic to jointly train a new frontier AI model for healthcare. Our announcements today mark another milestone on the road to humanist superintelligence. You can learn more and about our other new models in our latest blog: microsoft.ai/news/building-a…
13
24
388
26,629
Full tech report here: microsoft.ai/wp-content/uplo…

1
8
942
Kyle Lo retweeted
The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue. 4/
37
161
2,205
359,956
community too susceptible to ragebait. if some rando said in person “im gonna vibe a neurips paper in 3 days,” normal reaction wouldn’t be to seriously debate this person on research ethics/quality, it’d be to ignore 🤷🏻‍♂️
5
4
122
7,265
Kyle Lo retweeted
How2Everything will appear in ICML 2026! See you in Korea 🫡 We mine the web's procedural knowledge to better evaluate & train LLMs to generate valid step-by-step instructions, read more at: 🔗 arxiv.org/pdf/2602.08808

Feb 10
LLMs often generate step-by-step instructions, from real-world tasks (how do I file taxes?) to plans for AI agents. Improving this is hard: outputs can sound fluent for steps that don't work, and current datasets cover few domains. How2Everything evals/trains for this at scale. 🧵
7
69
6,524
during in Olmo 3 we thought long context is just finding good data nope! model architecture matters & it's hard to recover if mess it up led by @abertsch72, we release many pretrain runs w/ small arch changes and show huge long context performance diffs
Apr 30
Recipes for teaching language models to handle long inputs don't work equally well across model families. We wanted to know why—is it the architecture, the training data, or both? 🧵
1
2
44
3,946
more deets in @abertsch72 's thread: x.com/abertsch72/status/2049… download all the models from HF: huggingface.co/collections/a…
New paper! allenai.org/papers/olmpool This tackles a puzzle we found during the training of Olmo 3: how could two models with nearly identical short-context performance (and trained on the same data!) behave completely differently after long context extension?
4
512
Kyle Lo retweeted
I'm at ICLR presenting Olmix (oral) at the Data-FM workshop this Sunday, April 26 @ 10:30AM! DM me to chat about anything related to data and the model development process / try to find the best açaí pão de queijo with me 😋
Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇
4
9
64
7,708
my car doesn’t have self driving, but maybe openclaw w vlms and dashcam can do it 🤷🏻‍♂️
3
24
2,833
Kyle Lo retweeted
The Orange Cat Brain Atlas is here. 🧠🐈 Today, we published the first comprehensive cellular map of the orange cat brain. The new atlas reveals a single, specialized neuron responsible for behaviors like staring at walls, knocking objects off tables, and the 3am "zoomies."
11
126
710
52,306
Today I'm saying farewell to @allen_ai. I'm so proud of our team & grateful to have shared fully-open Olmo, Dolma, olmOCR, Molmo, etc with the world I know the team is more committed than ever to advancing open-source & open-science. Forever rooting for my dear friends 🫶
53
13
487
26,048
It’s been an amazing time building Olmo w Hanna’s leadership and 120% positive vibes 🫶🏻
Life update here: Last week marked the end of my time at Ai2. Proud to have built releases like Olmo, Tülu, FlexOlmo, DRTulu, OLMoTrace, OlmoE, and datasets including Dolma and Dolci—and of how strongly we pushed for open models and open science. Our artifacts reached 33M downloads, including ~4M for Olmo 3. I believe Olmo has empowered researchers to push the boundaries of AI I’ll always be cheering on Ai2 and will continue to strongly support open-source, open-science AI. I’m deeply grateful for this chapter and excited for what comes next.
1
47
6,077
lol best team vibes makes least depressed model
my contribution to olmo:
1
14
1,767
our new Olmo Hybrid model combines attention with linear RNN layers 🍣training efficiency is crazy good. the model reaches same MMLU score as Olmo 3 in 50% of the tokens. also see this in many other tasks as always: weights, data, ckpts, training code, etc. all fully open
Mar 5
Introducing Olmo Hybrid, a 7B fully open model combining transformer and linear RNN layers. It decisively outperforms Olmo 3 7B across evals, w/ new theory & scaling experiments explaining why. 🧵
5
8
69
6,289
someone's openclaw agent is spam emailing our team w generated questions about olmo, pls stop 🙄
4
1
50
8,668
DrawEduMath is our benchmark testing VLM understanding of K-12 student math work, which is prerequisite for their use in educational contexts one year after, while VLMs are strong math solvers today, they still underperform on our bench, esp for students who need the most help
Models are now expert math solvers, and so AI for math education is receiving increasing attention. Our new preprint evaluates 11 VLMs on our QA benchmark, DrawEduMath. We highlight a startling gap: models perform less well on inputs from K-12 students who need more help. 🧵
2
9
1,770
our paper on data mixing for LMs is out! while building Olmo 3, we saw gaps between data mixing literature and real practice 🐠choosing proxy size, # runs, sampling, regression, constraints.. 🐟data shifts during LM dev: can we reuse past experiments? Olmix tackles them all!
Feb 13
Data mixing – determining how much web text, code, math, etc., you need for LM development – is a first-order lever on model quality. Introducing Olmix: a framework for configuring mixing methods at the start of dev & efficiently updating as data changes throughout. 🧵
3
6
31
4,597
one of my favorite topics is dealing with data constraints! what if your proposed mix is 30% code but you don't have enough code? we can repeat our data until we hit target proportions, but too much is risky we view data mixing as (data) constrained optimization
1
6
594
this work was led by our intern @MayeeChen and was one of the new ideas we adopted into Olmo 3! her thread: x.com/MayeeChen/status/20223… arxiv paper: arxiv.org/abs/2602.12237 blog post: allenai.org/blog/olmix

Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇
1
6
1,164