Sarthak Mittal

Sarthak Mittal

46 Photos and videos

Tweets

Pinned Tweet

Sarthak Mittal

@sarthmit

Apr 22

Is distribution sharpening actually the future of scaling, or just a massive hype train? 📉 We put it to the test using an RL framework – simulating everything from sharpening to task reward optimization. Result: It’s not the silver bullet everyone thinks it is!

1,876

Sarthak Mittal

Sarthak Mittal

@sarthmit

Jun 13

Wow

Anthropic

@AnthropicAI

Jun 13

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

145

Sarthak Mittal

Sarthak Mittal

@sarthmit

Jun 5

I always feel more people should know this

Taco Cohen

@TacoCohen

Jun 4

Replying to @yoavgo

As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.

8,438

Oleksii Kuchaiev

Sarthak Mittal retweeted

Oleksii Kuchaiev

@kuchaev

Jun 4

Our post-training pipeline is a substantial redesign from Super. The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10 domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4

277

104,768

Dane Malenfant

Sarthak Mittal retweeted

Dane Malenfant

@dvnxmvl_hdf5

May 28

🚨Excited to announce our workshop Context Beyond the Window hosted at COLM in SF! 🚨 LLMs have finite context windows, yet real-world tasks demand absorbing, retaining, and acting on information that far exceeds any single prompt. 1/3 We're looking for submissions across: context-beyond-window.github… • Context compression 🧃 — token compaction, recursive subagent calls, and external memory for storing and retrieving information • Efficient architectures 🚀 — sub-quadratic attention variants that make extremely long context computationally feasible • Continual training 🌱 — test-time training on streaming data, context distillation, and knowledge accumulation through continued pre-training • Agentic memory systems 🐘 — scaffolds and test-time scaling techniques that improve knowledge retention and acquisition in LLMs • Evaluation 🎯 — benchmarking models on increasingly long-horizon tasks

Modern language models operate within finite context windows, yet many real-world tasks require models to absorb, retain, and act on information that far exceeds any single prompt.

This workshop addresses the full spectrum of context management: fitting more into the window, maintaining state across interactions, and transferring knowledge into parameters. We frame this around the trade-off between context-time memory (information supplied at inference) and weight-time memory (information absorbed into parameters).

Our goal is to build a shared vocabulary across subcommunities that rarely meet in one venue: long-context modeling, retrieval-augmented systems, continual learning, knowledge distillation, and LLM agents.

ALT Modern language models operate within finite context windows, yet many real-world tasks require models to absorb, retain, and act on information that far exceeds any single prompt. This workshop addresses the full spectrum of context management: fitting more into the window, maintaining state across interactions, and transferring knowledge into parameters. We frame this around the trade-off between context-time memory (information supplied at inference) and weight-time memory (information absorbed into parameters). Our goal is to build a shared vocabulary across subcommunities that rarely meet in one venue: long-context modeling, retrieval-augmented systems, continual learning, knowledge distillation, and LLM agents.

29,651

Sarthak Mittal

Sarthak Mittal

@sarthmit

May 26

RT @AnjaSurina: AlphaProof Nexus advancing research math, solving 9 Erdős problems & more! Amazing experience to be part of this team & pro…

Moksh Jain

Sarthak Mittal retweeted

Moksh Jain @JainMoksh

May 12

The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.

0:32

6,331

John Schulman

Sarthak Mittal retweeted

John Schulman

@johnschulman2

May 11

Sharing our work on full-duplex multimodal models -- real-time interaction that's natural and intuitive without compromising on intelligence. We started Thinky in part to differentially advance capabilities for human-AI collaboration, which are underemphasized relative to intelligence/autonomy because they're harder to eval. In the future, we think every AI system will have something like an interaction model as the outer user-facing layer, continually keeping the user informed and learning what they actually want.

Thinking Machines

@thinkymachines

May 11

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…

2:15

928

123,875

Sarthak Mittal

Sarthak Mittal

@sarthmit

May 4

Many congratulations to @FrankRHutter @noahholl, sauraj and the entire prior team!!

Prior Labs

@prior_labs

May 4

Today we announced a major milestone: @prior_labs has entered into a definitive agreement to be acquired by @SAP, scaling Prior Labs to become the next frontier AI lab for structured data. 🧵

286

Sarthak Mittal

Sarthak Mittal

@sarthmit

May 2

Kind of weird that Gemma 4 2B model is actually 5B (including embeddings, guess that is why they say E2B) And here I was thinking I found something comparative to Qwen3 1.7B

413

Sarthak Mittal

Sarthak Mittal

@sarthmit

Apr 28

The real moat feels like data and compute; what do others think?

This tweet is unavailable

313

Sarthak Mittal

Sarthak Mittal

@sarthmit

Apr 22

1,876

more replies

Sarthak Mittal

Sarthak Mittal

@sarthmit

Apr 22

Could this all have to do with RL-training instabilities and not distribution sharpening? Our training health checks highlight consistently improving reward, showing that the training methodology works fine, but the optimum is to blame.

147

Sarthak Mittal

Sarthak Mittal

@sarthmit

Apr 22

We used the Nemo RL codebase to implement the RL training. Paper: arxiv.org/pdf/2604.16259 Joint work with Leo and Guillaume. The setup is heavily inspired from arxiv.org/abs/2310.04363

143

Sarthak Mittal

Sarthak Mittal

@sarthmit

Apr 22

Sarthak Mittal

Sarthak Mittal

@sarthmit

Apr 22

We used the Nemo RL codebase to implement the RL training. Paper: arxiv.org/abs/2604.16259 Joint work with Leo Gagnon and @g_lajoie_ The setup is heavily inspired from arxiv.org/abs/2310.04363

Beyond Distribution Sharpening: The Importance of Task Rewards

Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from...

arxiv.org