Graduate Student at @Mila_Quebec and Student Researcher at @GoogleResearch. Previously interned at @Meta @Apple @MorganStanley @NVIDIAAI and @YorkUniversity

Joined February 2019
46 Photos and videos
Pinned Tweet
Is distribution sharpening actually the future of scaling, or just a massive hype train? 📉 We put it to the test using an RL framework – simulating everything from sharpening to task reward optimization. Result: It’s not the silver bullet everyone thinks it is!
1
7
12
1,876
Wow
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…
1
145
I always feel more people should know this
Replying to @yoavgo
As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.
1
15
8,438
Sarthak Mittal retweeted
Our post-training pipeline is a substantial redesign from Super. The core idea: don't rely on stacked RL stages alone. We do SFT, multi-environment RLVR across a huge mix of agentic/reasoning/code/safety environments, then Multi-teacher On-Policy Distillation (MOPD). 10 domain-specialized teachers, merged into the student via dense token-level guidance on its own rollouts. See Figures below for overview and tech report for all the details. 2/4
7
38
277
104,768
Sarthak Mittal retweeted
🚨Excited to announce our workshop Context Beyond the Window hosted at COLM in SF! 🚨 LLMs have finite context windows, yet real-world tasks demand absorbing, retaining, and acting on information that far exceeds any single prompt. 1/3 We're looking for submissions across: context-beyond-window.github… • Context compression 🧃 — token compaction, recursive subagent calls, and external memory for storing and retrieving information • Efficient architectures 🚀 — sub-quadratic attention variants that make extremely long context computationally feasible • Continual training 🌱 — test-time training on streaming data, context distillation, and knowledge accumulation through continued pre-training • Agentic memory systems 🐘 — scaffolds and test-time scaling techniques that improve knowledge retention and acquisition in LLMs • Evaluation 🎯 — benchmarking models on increasingly long-horizon tasks
6
29
94
29,651
RT @AnjaSurina: AlphaProof Nexus advancing research math, solving 9 Erdős problems & more! Amazing experience to be part of this team & pro…
2
24
Sarthak Mittal retweeted
The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.
1
17
38
6,331
Sarthak Mittal retweeted
Sharing our work on full-duplex multimodal models -- real-time interaction that's natural and intuitive without compromising on intelligence. We started Thinky in part to differentially advance capabilities for human-AI collaboration, which are underemphasized relative to intelligence/autonomy because they're harder to eval. In the future, we think every AI system will have something like an interaction model as the outer user-facing layer, continually keeping the user informed and learning what they actually want.
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…
35
84
928
123,875
Many congratulations to @FrankRHutter @noahholl, sauraj and the entire prior team!!
Today we announced a major milestone: @prior_labs has entered into a definitive agreement to be acquired by @SAP, scaling Prior Labs to become the next frontier AI lab for structured data. 🧵
1
286
Kind of weird that Gemma 4 2B model is actually 5B (including embeddings, guess that is why they say E2B) And here I was thinking I found something comparative to Qwen3 1.7B
1
5
413
The real moat feels like data and compute; what do others think?
3
313
Is distribution sharpening actually the future of scaling, or just a massive hype train? 📉 We put it to the test using an RL framework – simulating everything from sharpening to task reward optimization. Result: It’s not the silver bullet everyone thinks it is!
1
7
12
1,876
Could this all have to do with RL-training instabilities and not distribution sharpening? Our training health checks highlight consistently improving reward, showing that the training methodology works fine, but the optimum is to blame.
1
147
We used the Nemo RL codebase to implement the RL training. Paper: arxiv.org/pdf/2604.16259 Joint work with Leo and Guillaume. The setup is heavily inspired from arxiv.org/abs/2310.04363
2
143
Could this all have to do with RL-training instabilities and not distribution sharpening? Our training health checks highlight consistently improving reward, showing that the training methodology works fine, but the optimum is to blame.
1
37