Sasha Rush

Sasha Rush

Photos and videos

Tweets

Anton Baumann retweeted

Sasha Rush

@srush_nlp

Jun 4

On-Policy Distillation is the most active new research direction being explored in RL for LLMs. Had the chance to discuss how it works with Dwarkesh and why it fits so nicely into large-scale pipelines.

Dwarkesh Patel

@dwarkesh_sp

Jun 4

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works. I asked him if I could record it on my iPhone. The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory. So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made. Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required. The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

13:15

127

1,303

138,260

Ronak Malde

Anton Baumann retweeted

Ronak Malde

@rronak_

May 27

We have been exploring new algorithmic frontiers and are excited to share our contributions to Self Distillation Policy Optimization (SDPO) for agentic continual learning, check out our blog post here: trajectory.ai/field-notes/sc…

Scaling SDPO - Trajectory

Research lab and product company building the platform for continual learning.

trajectory.ai

38,737

Sasha Rush

Anton Baumann retweeted

Sasha Rush

@srush_nlp

May 18

Been working on text feedback / OPSD in Composer. Really interesting space, and much more to be explored.

Cursor

@cursor_ai

May 18

Replying to @cursor_ai

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. For example, we use text feedback during RL to learn faster by assigning credit in rollouts spanning hundreds of thousands of tokens.

277

39,874

Jonas Hübotter

Anton Baumann retweeted

Jonas Hübotter

@jonashubotter

May 18

Self-distillation for long-horizon training at scale!

Cursor

@cursor_ai

May 18

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

4,846

Jonas Hübotter

Anton Baumann retweeted

Jonas Hübotter

@jonashubotter

Apr 26

Today and tomorrow we’ll be presenting self-distillation with orals at ICLR in Rio 🇧🇷 1. “Self-Distillation enables Continual Learning” at lifelong agents workshop (Sun 11:30am) 2. “Reinforcement Learning via Self-Distillation” at scaling post-training workshop (Mon 2:40pm) 3. “Test-Time Self-Distillation” at test-time updates workshop (Mon 4:15pm)

431

101,921

Jonas Hübotter

Anton Baumann retweeted

Jonas Hübotter

@jonashubotter

Feb 15

Just came across this great discussion of self-distillation on @latentspacepod! Really good run down by Ted Kyi and we’re every bit excited about what’s next as he is! m.youtube.com/watch?v=CrJp0s…

RL via Self-Distillation (SDPO) Paper Club 12 Feb 2026

Latent Space Paper Club with Johan Duramy and swyx - 12 Feb 2026T...

youtube.com

3,123

Explainable Machine Learning

Anton Baumann retweeted

Explainable Machine Learning @ExplainableML

Feb 12

3/ Post-hoc Probabilistic Vision-Language Models @_antonbaumann, @ruili_pml, Marcus Klasson, Santeri Mentu, @ShyamgopalKart1, @zeynepakata, @arnosolin, Martin Trapp [Paper]: arxiv.org/pdf/2412.06014 [Project]: aaltoml.github.io/BayesVLM/ [Code]: github.com/AaltoML/BayesVLM

166

Jonas Hübotter

Anton Baumann retweeted

Jonas Hübotter

@jonashubotter

Jan 29

Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)

139

1,116

211,165