Kyle Lo

Kyle Lo

85 Photos and videos

Tweets

Pinned Tweet

Kyle Lo

@kylelostat

17 Dec 2025

olmo 3 paper finally on arxiv 🫡 thx to our teammates esp folks who chased additional baselines thx to arxiv-latex-cleaner and overleaf feature for chasing latex bugs thx for all the helpful discussions after our Nov release, best part of open science is progressing together!

441

56,863

elie

Kyle Lo retweeted

elie

@eliebakouch

Jun 3

microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale. this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab. the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale let's look at all of this in this likely very long thread 🧵

Mustafa Suleyman

@mustafasuleyman

Jun 2

Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier. First is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. - It’s a 35B active parameter MoE with a 256K context window. Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6, and it’s achieved 97% on AIME 2025, the key measure of its general-purpose reasoning abilities. - It's at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks. - And since we co-designed our models with our own silicon, MAI-Thinking-1 is optimized on our MAIA 200 chip. Benchmarking head-to-head against the GB200, we see 30% better performance per dollar as well as a 1.4x performance-per-watt gain when running our MAI models on the MAIA 200 end-to-end. Next is MAI-Image-2.5 and its Flash variant. Two super strong models now at #2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing. Last for now is MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI. - Code-1-Flash achieves 51% on SWE Bench Pro, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost. All of this is the foundation for Microsoft Frontier Tuning. It lets you customize our models to create custom, company-specific agents that only you control. You can make our model, your model. Your data. Your agents. Your moat. Early adopters are already seeing a difference. When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, while being 10x lower on cost. Also really excited to be collaborating with the amazing team at Mayo Clinic to jointly train a new frontier AI model for healthcare. Our announcements today mark another milestone on the road to humanist superintelligence. You can learn more and about our other new models in our latest blog: microsoft.ai/news/building-a…

267

2,088

283,508

Kyle Lo

Kyle Lo

@kylelostat

Jun 2

happy to share another quality tech report w/ the wider research community 🫶 great read for ppl who want to see all the details for methods infra for scaling up pretraining & RL, esp detailed discussion about data which is often kept vague by other labs

Mustafa Suleyman

@mustafasuleyman

Jun 2

388

26,629

Kyle Lo

Kyle Lo

@kylelostat

Jun 2

Full tech report here: microsoft.ai/wp-content/uplo…

942

Thomas G. Dietterich

Kyle Lo retweeted

Thomas G. Dietterich @tdietterich

May 14

The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue. 4/

161

2,205

359,956

Kyle Lo

Kyle Lo

@kylelostat

May 4

community too susceptible to ragebait. if some rando said in person “im gonna vibe a neurips paper in 3 days,” normal reaction wouldn’t be to seriously debate this person on research ethics/quality, it’d be to ignore 🤷🏻‍♂️

122

7,265

Yapei Chang

Kyle Lo retweeted

Yapei Chang

@YapeiChang

Apr 30

How2Everything will appear in ICML 2026! See you in Korea 🫡 We mine the web's procedural knowledge to better evaluate & train LLMs to generate valid step-by-step instructions, read more at: 🔗 arxiv.org/pdf/2602.08808

Ai2

@allen_ai

Feb 10

LLMs often generate step-by-step instructions, from real-world tasks (how do I file taxes?) to plans for AI agents. Improving this is hard: outputs can sound fluent for steps that don't work, and current datasets cover few domains. How2Everything evals/trains for this at scale. 🧵

6,524

Kyle Lo

Kyle Lo

@kylelostat

Apr 30

during in Olmo 3 we thought long context is just finding good data nope! model architecture matters & it's hard to recover if mess it up led by @abertsch72, we release many pretrain runs w/ small arch changes and show huge long context performance diffs

Ai2

@allen_ai

Apr 30

Recipes for teaching language models to handle long inputs don't work equally well across model families. We wanted to know why—is it the architecture, the training data, or both? 🧵

3,946

Kyle Lo

Kyle Lo

@kylelostat

Apr 30

more deets in @abertsch72 's thread: x.com/abertsch72/status/2049… download all the models from HF: huggingface.co/collections/a…

Amanda Bertsch @abertsch72

Apr 30

New paper! allenai.org/papers/olmpool This tackles a puzzle we found during the training of Olmo 3: how could two models with nearly identical short-context performance (and trained on the same data!) behave completely differently after long context extension?

512

Mayee Chen

Kyle Lo retweeted

Mayee Chen

@MayeeChen

Apr 22

I'm at ICLR presenting Olmix (oral) at the Data-FM workshop this Sunday, April 26 @ 10:30AM! DM me to chat about anything related to data and the model development process / try to find the best açaí pão de queijo with me 😋

Mayee Chen

@MayeeChen

Feb 13

Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇

7,708

Kyle Lo

Kyle Lo

@kylelostat

Apr 8

my car doesn’t have self driving, but maybe openclaw w vlms and dashcam can do it 🤷🏻‍♂️

2,833

allen institute

Kyle Lo retweeted

allen institute @AllenInstitute

Apr 1

The Orange Cat Brain Atlas is here. 🧠🐈 Today, we published the first comprehensive cellular map of the orange cat brain. The new atlas reveals a single, specialized neuron responsible for behaviors like staring at walls, knocking objects off tables, and the 3am "zoomies."

0:10

126

710

52,306

Kyle Lo

Kyle Lo

@kylelostat

Mar 27

Today I'm saying farewell to @allen_ai. I'm so proud of our team & grateful to have shared fully-open Olmo, Dolma, olmOCR, Molmo, etc with the world I know the team is more committed than ever to advancing open-source & open-science. Forever rooting for my dear friends 🫶

487

26,048

Kyle Lo

Kyle Lo

@kylelostat

Mar 24

It’s been an amazing time building Olmo w Hanna’s leadership and 120% positive vibes 🫶🏻

Hanna Hajishirzi

@HannaHajishirzi

Mar 23

Life update here: Last week marked the end of my time at Ai2. Proud to have built releases like Olmo, Tülu, FlexOlmo, DRTulu, OLMoTrace, OlmoE, and datasets including Dolma and Dolci—and of how strongly we pushed for open models and open science. Our artifacts reached 33M downloads, including ~4M for Olmo 3. I believe Olmo has empowered researchers to push the boundaries of AI I’ll always be cheering on Ai2 and will continue to strongly support open-source, open-science AI. I’m deeply grateful for this chapter and excited for what comes next.

6,077

Kyle Lo

Kyle Lo

@kylelostat

Mar 11

lol best team vibes makes least depressed model

Luca Soldaini 🎀

@soldni

Mar 11

my contribution to olmo:

1,767

Kyle Lo

Kyle Lo

@kylelostat

Mar 5

our new Olmo Hybrid model combines attention with linear RNN layers 🍣training efficiency is crazy good. the model reaches same MMLU score as Olmo 3 in 50% of the tokens. also see this in many other tasks as always: weights, data, ckpts, training code, etc. all fully open

Ai2

@allen_ai

Mar 5

Introducing Olmo Hybrid, a 7B fully open model combining transformer and linear RNN layers. It decisively outperforms Olmo 3 7B across evals, w/ new theory & scaling experiments explaining why. 🧵

6,289

Kyle Lo

Kyle Lo

@kylelostat

Mar 5

big congrats to @lambdaviking for leading this project & core contributors @YanhongLi2062 @tyleraromero @AnejSvete @CaiaCostello blog: allenai.org/blog/olmohybrid paper: allenai.org/papers/olmo-hybr… hf collection: huggingface.co/collections/a…

640

Kyle Lo

Kyle Lo

@kylelostat

Mar 4

someone's openclaw agent is spam emailing our team w generated questions about olmo, pls stop 🙄

8,668

Kyle Lo

Kyle Lo

@kylelostat

Mar 3

DrawEduMath is our benchmark testing VLM understanding of K-12 student math work, which is prerequisite for their use in educational contexts one year after, while VLMs are strong math solvers today, they still underperform on our bench, esp for students who need the most help

Lucy Li @lucy3_li

Mar 3

Models are now expert math solvers, and so AI for math education is receiving increasing attention. Our new preprint evaluates 11 VLMs on our QA benchmark, DrawEduMath. We highlight a startling gap: models perform less well on inputs from K-12 students who need more help. 🧵

Title, author list, and two figures from the paper.
Title: The Aftermath of DrawEduMath: Vision Language Models
Underperform with Struggling Students and Misdiagnose Errors
Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo
Figure 1: On the left is a math problem, where students are asked to draw x < 5/2 on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response.
Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses is labeled with specific VLMs’ names; that same model’s performance on erroneous student responses is directly below.

ALT Title, author list, and two figures from the paper. Title: The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo Figure 1: On the left is a math problem, where students are asked to draw x < 5/2 on a number line. The right side shows two example student responses that differ in correctness. DrawEduMath pairs each math problem with one student response, and prompts VLMs to answer questions about the student response. Figure 2: VLMs consistently perform worse on answering DrawEduMath benchmark questions pertaining to erroneous student responses. Performance on non-erroneous student responses is labeled with specific VLMs’ names; that same model’s performance on erroneous student responses is directly below.

1,770

Kyle Lo

Kyle Lo

@kylelostat

Feb 13

our paper on data mixing for LMs is out! while building Olmo 3, we saw gaps between data mixing literature and real practice 🐠choosing proxy size, # runs, sampling, regression, constraints.. 🐟data shifts during LM dev: can we reuse past experiments? Olmix tackles them all!

Ai2

@allen_ai

Feb 13

Data mixing – determining how much web text, code, math, etc., you need for LM development – is a first-order lever on model quality. Introducing Olmix: a framework for configuring mixing methods at the start of dev & efficiently updating as data changes throughout. 🧵

4,597

Kyle Lo

Kyle Lo

@kylelostat

Feb 13

one of my favorite topics is dealing with data constraints! what if your proposed mix is 30% code but you don't have enough code? we can repeat our data until we hit target proportions, but too much is risky we view data mixing as (data) constrained optimization

594

Kyle Lo

Kyle Lo

@kylelostat

Feb 13

this work was led by our intern @MayeeChen and was one of the new ideas we adopted into Olmo 3! her thread: x.com/MayeeChen/status/20223… arxiv paper: arxiv.org/abs/2602.12237 blog post: allenai.org/blog/olmix

Mayee Chen

@MayeeChen

Feb 13

1,164