PhD student @uwcse @uwnlp

Joined February 2023
27 Photos and videos
Pinned Tweet
13 Nov 2025
OpenAI's blog (openai.com/index/why-languagโ€ฆ) points out that todayโ€™s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?๐Ÿค” On-policy RL with our Binary Retrieval-Augmented Reward (RAR) can improve factuality (40% reduction in hallucination) while preserving model utility (win rate and accuracy) of fully trained, capable LMs like Qwen3-8B. [1/n]
27
122
671
113,335
Tong Chen retweeted
Jun 11
Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on nanochat / nanogpt speedrun / kernel benchmarks using the same underlying system without task-specific adaptations. blog: recursive.com/articles/firstโ€ฆ
4
24
110
16,890
Tong Chen retweeted
Automatic research from mathematics to AI research: We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1โ€“2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then: โญ• Results: improve (non-interpolation) SOTA from 2875 to 2755 steps. Changes: : non-gain aux ฮฒโ‚‚ = 0.997; SOAP for all hidden with freq=1; LR-horizon momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor... Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources! [1/n]
10
24
130
34,567
Tong Chen retweeted
MAI-Thinking-1 is out! Excited to share what we are building and how climbing from scratch (no distillation) actually works: simple recipes, rigorous science, self-distillation, patience, and great infra. Check out our tech report has the full story of our RL climbs. microsoft.ai/wp-content/uploโ€ฆ
Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier. First is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. - Itโ€™s a 35B active parameter MoE with a 256K context window. Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6, and itโ€™s achieved 97% on AIME 2025, the key measure of its general-purpose reasoning abilities. - It's at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks. - And since we co-designed our models with our own silicon, MAI-Thinking-1 is optimized on our MAIA 200 chip. Benchmarking head-to-head against the GB200, we see 30% better performance per dollar as well as a 1.4x performance-per-watt gain when running our MAI models on the MAIA 200 end-to-end. Next is MAI-Image-2.5 and its Flash variant. Two super strong models now at #2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing. Last for now is MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI. - Code-1-Flash achieves 51% on SWE Bench Pro, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost. All of this is the foundation for Microsoft Frontier Tuning. It lets you customize our models to create custom, company-specific agents that only you control. You can make our model, your model. Your data. Your agents. Your moat. Early adopters are already seeing a difference. When we tuned our models for McKinseyโ€™s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, while being 10x lower on cost. Also really excited to be collaborating with the amazing team at Mayo Clinic to jointly train a new frontier AI model for healthcare. Our announcements today mark another milestone on the road to humanist superintelligence. You can learn more and about our other new models in our latest blog: microsoft.ai/news/building-aโ€ฆ
24
127
872
124,062
Tong Chen retweeted
๐Ÿงต(1/8) An @OpenAI internal reasoning LLM achieved an AI Math milestone: solving an open problem central to its mathematical subfieldโ€” in this case, the unit distance problem of discrete geometry. We came across it in a side quest to truly push our model on the hardest problems.
26
134
955
140,792
Tong Chen retweeted
LMs can learn from human labels, training data, and stronger teachers. But what happens when all of these run out๐Ÿซช when the model is already at the frontier and there is no stronger external source to learn fromโ“ In EvoLM, we extract the model's own evaluative knowledge into rubrics, and use them to improve its own generation๐Ÿ” This enables self-improvement with no external signalsโ€ผ๏ธ
6
45
230
35,073
Tong Chen retweeted
2 papers accepted to ICML as Spotlights (top 2.2%)๐Ÿฅณ - DR Tulu: RL w/ evolving rubrics for SOTA long-form deep research arxiv.org/abs/2511.19399 - Binary RAR: RL w/ binary rewards for the hallucinationโ€“capability trade-off arxiv.org/abs/2510.17733 Congrats to all collaborators!
7
18
234
11,733
Tong Chen retweeted
New work @AIatMeta: We enable test-time scaling for long-horizon coding agents by using better representations, selection and reuse of agentic trajectories, with Claude 4.5 Opus improving by 6.7% on SWE-Bench Verified and 12.1% on Terminal-Bench 2.0. ๐Ÿ“„: arxiv.org/abs/2604.16529
9
43
360
278,988
Tong Chen retweeted
๐Ÿš€ New work: Meta-Reinforcement Learning with Self-Reflection LLM agents shouldn't just solve problems. They should learn from their own attempts. Most current RL methods optimize single independent trajectories. Each attempt starts from scratch, with no mechanism to improve across attempts. But intelligent systems should get better after trying once. This raises a fundamental question: How do we train models to learn from their own attempts? We believe Meta-Reinforcement Learning may be a key paradigm for training future LLM agents, enabling models to adapt and improve across attempts and environments. In this work we introduce MR-Search, a training paradigm built around: ๐Ÿง  In-Context Meta-Reinforcement Learning ๐Ÿชž Self-Reflection ๐Ÿ” Learning to learn at test time ๐Ÿ“„ Paper: arxiv.org/abs/2603.11327 ๐Ÿ’ป Code: github.com/tengxiao1/MR-Searโ€ฆ
11
49
298
51,482
Tong Chen retweeted
Small language models are not very helpful as judges, how about ๐Ÿ”„ backward inferenceโ€”inferring the instruction given only the response, and using the similarity between the inferred and the original instructions as the reward signal? Introducing โš™๏ธFLIP, a reference-free and rubric-free reward modeling approach that boosts the RewardBench2 performance of 13 small language models by an average of 79.6%, and substantially outperforms LLM-as-a-Judge under test-time scaling via parallel sampling and GRPO training. ๐Ÿ“„paper: arxiv.org/abs/2602.13551ย  ๐Ÿ”—code: github.com/yikee/FLIP
12
53
250
28,318
Tong Chen retweeted
For decades, weโ€™ve trained AI to chase rewards. But humans donโ€™t just optimize outcomes. We experience, reflect, then learn. Can AI do the same? Introducing ๐„๐ฑ๐ฉ๐ž๐ซ๐ข๐ž๐ง๐ญ๐ข๐š๐ฅ ๐‘๐ž๐ข๐ง๐Ÿ๐จ๐ซ๐œ๐ž๐ฆ๐ž๐ง๐ญ ๐‹๐ž๐š๐ซ๐ง๐ข๐ง๐ , a step toward AI that truly learn from experience.
39
218
1,319
223,801
Tong Chen retweeted
Thrilled to share: OpenScholar - our work on scientific deep research agents for reliable literature synthesis -has been accepted to Nature! ๐ŸŽ‰ Huge thanks to collaborators across institutions who made this possible!
33
224
1,263
126,921
Tong Chen retweeted
Calling on behalf of infini-gram: does anyone know where I can get / apply for AWS credits? ๐Ÿ’ธ๐Ÿ’ธ Keeping infini-gram alive costs quite some money, mostly SSD rental. If you're a fan of keeping open LLM training data readily inspectable, please reply / DM me some pointers! ๐Ÿงต1/4
3
16
28
3,962
Tong Chen retweeted
Jan 22
Can LLMs automate frontier LLM research, like pre-training and post-training? In our new paper, LLMs found post-training methods that beat GRPO (69.4% vs 48.0%), and pre-training recipes faster than nanoGPT (19.7 minutes vs 35.9 minutes). 1/
10
140
585
110,525
Tong Chen retweeted
AI used to be a distant promise; now it permeates our lives. AI is getting better, but is it making us better? We are promised that AI will augment our minds, but how? We--@EchoShao8899, @shannonzshen, and @michaelryan207--are excited to launch the Augmented Mind Podcast (The AM Podcast), a podcast about technical human-centered AI work. We'll share compelling research, infrastructure, and systems through monthly episodes, featuring interviews with the pioneering minds behind them. We release EP0 today to share who we are, why we started this podcast, and what we're looking forward to. 0:00 - Prelude: the problems we care about 1:48 - Host introduction 2:03 - Why we started the AM Podcast 2:31 - Hot takes on human-centered AI 10:45 - Format of our podcast 11:28 - Unique technical challenges in human-centered AI 16:45 - Let the journey begin!
10
35
83
67,746
Tong Chen retweeted
Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)
โš ๏ธDifferent models. Same thoughts.โš ๏ธ Todayโ€™s AI models converge into an ๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ ๐‡๐ข๐ฏ๐ž๐ฆ๐ข๐ง๐ ๐Ÿ, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 ๐ƒ&๐ ๐Ž๐ซ๐š๐ฅ ๐ฉ๐š๐ฉ๐ž๐ซ (โœจ๐ญ๐จ๐ฉ ๐ŸŽ.๐Ÿ‘๐Ÿ“%โœจ) dives deep into this phenomenon, introducing ๐ˆ๐ง๐Ÿ๐ข๐ง๐ข๐ญ๐ฒ-๐‚๐ก๐š๐ญ, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories 31K dense human annotations (๐Ÿ๐Ÿ“ ๐ข๐ง๐๐ž๐ฉ๐ž๐ง๐๐ž๐ง๐ญ ๐š๐ง๐ง๐จ๐ญ๐š๐ญ๐จ๐ซ๐ฌ ๐ฉ๐ž๐ซ ๐ž๐ฑ๐š๐ฆ๐ฉ๐ฅ๐ž) to push AIโ€™s creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. ๐Ÿ“Paper: arxiv.org/abs/2510.22954 ๐Ÿ“Data: huggingface.co/collections/lโ€ฆ We also systematically reveal Artificial Hivemind across: ๐Ÿ’ฅ ๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ฏ๐ž ๐š๐›๐ข๐ฅ๐ข๐ญ๐ข๐ž๐ฌ: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. ๐Ÿ’ฅ ๐ƒ๐ข๐ฌ๐œ๐ซ๐ข๐ฆ๐ข๐ง๐š๐ญ๐ข๐ฏ๐ž ๐š๐›๐ข๐ฅ๐ข๐ญ๐ข๐ž๐ฌ: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)
37
68
777
80,548
Tong Chen retweeted
3 Dec 2025
I'll be at #NeurIPS2025 until 12/7! I work on post-training and reward signals (Spurious Rewards), currently curious about bridging the gap between how humans and LLMs learn. Looking forward to connecting with new and old friendsโ€”also exploring summer 2025 internships. DMs open!
2
7
56
15,761
3 Dec 2025
I will be at #NeurIPS2025 12.3โ€“12.7 Looking forward to meeting old and new friends ! โ˜•๏ธ๐ŸŒฎ Recently working on hallucination (Binary RAR) and verbatim memorization (ParaPO), issues that scaling up pretraining cannot simply fix. Also interested in making models learn more like humans: strong generalization, non-scalar rewards, episodic memory, and long-horizon abilities.
1
5
37
4,095
Tong Chen retweeted
1 Dec 2025
8B model can outperform AlphaEvolve on open optimization problems by scaling compute for inference or test-time RL๐Ÿš€! โญ•Circle packing: AlphaEvolve (Gemini-2.0-Flash/Pro) : 2.63586276 Ours (DeepSeek-R1-0528-Qwen3-8B) : 2.63598308 ๐Ÿ”—in๐Ÿงต [1/n]
8
51
201
45,443
28 Nov 2025
PhD applicants โ€” Join Akariโ€™s first cohort of students! Akari's research ranges from careful benchmarking to solid methodology. She always gives sharp feedback while being thoughtful and supportive. She stayed driven throughout her PhD and now brings that same energy to her new lab. I am grateful to learn from her and to work with her โ€” please apply!
25 Nov 2025
1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) ๐ŸŽ“ I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in ๐Ÿงต
2
3
83
17,464
Tong Chen retweeted
25 Nov 2025
Exciting DR Tulu updates! ๐Ÿ“ˆ DR Tulu-8B (new RL ckpt) sits on the performanceโ€“cost frontier, beating Tongyi DR-30B and matching OpenAI DR/Gemini 3 Pro Search at a fraction of the cost. Now on arXiv. ๐Ÿ–ฅ๏ธ You can run an interactive CLI demo with open code, almost for free. 1/๐Ÿงต
18 Nov 2025
Today weโ€™re releasing Deep Research Tulu (DR Tulu)โ€”the first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. ๐Ÿงญ๐Ÿ“š
4
28
152
50,577