Tong Chen

Tong Chen

27 Photos and videos

Tweets

Pinned Tweet

Tong Chen

@tomchen0

13 Nov 2025

OpenAI's blog (openai.com/index/why-languag…) points out that today’s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?🤔 On-policy RL with our Binary Retrieval-Augmented Reward (RAR) can improve factuality (40% reduction in hallucination) while preserving model utility (win rate and accuracy) of fully trained, capable LMs like Qwen3-8B. [1/n]

122

671

113,335

CLS

Tong Chen retweeted

CLS

@ChengleiSi

Jun 11

Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on nanochat / nanogpt speedrun / kernel benchmarks using the same underlying system without task-specific adaptations. blog: recursive.com/articles/first…

First Steps Toward Automated AI Research - Recursive

Early results from Recursive’s automated AI research system on model training and GPU kernel benchmarks

recursive.com

Recursive

@Recursive_SI

Jun 11

x.com/i/article/206456979931…

110

16,890

Yiping Wang

Tong Chen retweeted

Yiping Wang

@ypwang61

Jun 9

Automatic research from mathematics to AI research: We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then: ⭕ Results: improve (non-interpolation) SOTA from 2875 to 2755 steps. Changes: : non-gain aux β₂ = 0.997; SOAP for all hidden with freq=1; LR-horizon momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor... Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources! [1/n]

130

34,567

Hanna Hajishirzi

Tong Chen retweeted

Hanna Hajishirzi

@HannaHajishirzi

Jun 2

MAI-Thinking-1 is out! Excited to share what we are building and how climbing from scratch (no distillation) actually works: simple recipes, rigorous science, self-distillation, patience, and great infra. Check out our tech report has the full story of our RL climbs. microsoft.ai/wp-content/uplo…

Mustafa Suleyman

@mustafasuleyman

Jun 2

Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier. First is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. - It’s a 35B active parameter MoE with a 256K context window. Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6, and it’s achieved 97% on AIME 2025, the key measure of its general-purpose reasoning abilities. - It's at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks. - And since we co-designed our models with our own silicon, MAI-Thinking-1 is optimized on our MAIA 200 chip. Benchmarking head-to-head against the GB200, we see 30% better performance per dollar as well as a 1.4x performance-per-watt gain when running our MAI models on the MAIA 200 end-to-end. Next is MAI-Image-2.5 and its Flash variant. Two super strong models now at #2 on the leaderboards, surpassing the score of Nano Banana 2 on image editing. Last for now is MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI. - Code-1-Flash achieves 51% on SWE Bench Pro, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost. All of this is the foundation for Microsoft Frontier Tuning. It lets you customize our models to create custom, company-specific agents that only you control. You can make our model, your model. Your data. Your agents. Your moat. Early adopters are already seeing a difference. When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, while being 10x lower on cost. Also really excited to be collaborating with the amazing team at Mayo Clinic to jointly train a new frontier AI model for healthcare. Our announcements today mark another milestone on the road to humanist superintelligence. You can learn more and about our other new models in our latest blog: microsoft.ai/news/building-a…

127

872

124,062

Hongxun Wu

Tong Chen retweeted

Hongxun Wu @HongxunWu

May 20

🧵(1/8) An @OpenAI internal reasoning LLM achieved an AI Math milestone: solving an open problem central to its mathematical subfield— in this case, the unit distance problem of discrete geometry. We came across it in a side quest to truly push our model on the hardest problems.

134

955

140,792

Stella Li

Tong Chen retweeted

Stella Li

@StellaLisy

May 7

LMs can learn from human labels, training data, and stronger teachers. But what happens when all of these run out🫪 when the model is already at the frontier and there is no stronger external source to learn from❓ In EvoLM, we extract the model's own evaluative knowledge into rubrics, and use them to improve its own generation🔁 This enables self-improvement with no external signals‼️

230

35,073

Akari Asai

Tong Chen retweeted

Akari Asai

@AkariAsai

Apr 30

2 papers accepted to ICML as Spotlights (top 2.2%)🥳 - DR Tulu: RL w/ evolving rubrics for SOTA long-form deep research arxiv.org/abs/2511.19399 - Binary RAR: RL w/ binary rewards for the hallucination–capability trade-off arxiv.org/abs/2510.17733 Congrats to all collaborators!

234

11,733

Joongwon Kim

Tong Chen retweeted

Joongwon Kim

@danieljwkim

Apr 22

New work @AIatMeta: We enable test-time scaling for long-horizon coding agents by using better representations, selection and reuse of agentic trajectories, with Claude 4.5 Opus improving by 6.7% on SWE-Bench Verified and 12.1% on Terminal-Bench 2.0. 📄: arxiv.org/abs/2604.16529

360

278,988

Teng Xiao

Tong Chen retweeted

Teng Xiao

@TengX6

Mar 16

🚀 New work: Meta-Reinforcement Learning with Self-Reflection LLM agents shouldn't just solve problems. They should learn from their own attempts. Most current RL methods optimize single independent trajectories. Each attempt starts from scratch, with no mechanism to improve across attempts. But intelligent systems should get better after trying once. This raises a fundamental question: How do we train models to learn from their own attempts? We believe Meta-Reinforcement Learning may be a key paradigm for training future LLM agents, enabling models to adapt and improve across attempts and environments. In this work we introduce MR-Search, a training paradigm built around: 🧠 In-Context Meta-Reinforcement Learning 🪞 Self-Reflection 🔁 Learning to learn at test time 📄 Paper: arxiv.org/abs/2603.11327 💻 Code: github.com/tengxiao1/MR-Sear…

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent...

arxiv.org

298

51,482

Yike Wang

Tong Chen retweeted

Yike Wang

@yikewang_

Feb 18

Small language models are not very helpful as judges, how about 🔄 backward inference—inferring the instruction given only the response, and using the similarity between the inferred and the original instructions as the reward signal? Introducing ⚙️FLIP, a reference-free and rubric-free reward modeling approach that boosts the RewardBench2 performance of 13 small language models by an average of 79.6%, and substantially outperforms LLM-as-a-Judge under test-time scaling via parallel sampling and GRPO training. 📄paper: arxiv.org/abs/2602.13551 🔗code: github.com/yikee/FLIP

250

28,318

Taiwei Shi

Tong Chen retweeted

Taiwei Shi @taiwei_shi

Feb 17

For decades, we’ve trained AI to chase rewards. But humans don’t just optimize outcomes. We experience, reflect, then learn. Can AI do the same? Introducing 𝐄𝐱𝐩𝐞𝐫𝐢𝐞𝐧𝐭𝐢𝐚𝐥 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠, a step toward AI that truly learn from experience.

218

1,319

223,801

Akari Asai

Tong Chen retweeted

Akari Asai

@AkariAsai

Feb 4

Thrilled to share: OpenScholar - our work on scientific deep research agents for reliable literature synthesis -has been accepted to Nature! 🎉 Huge thanks to collaborators across institutions who made this possible!

224

1,263

126,921

Jiacheng Liu

Tong Chen retweeted

Jiacheng Liu @liujc1998

Jan 26

Calling on behalf of infini-gram: does anyone know where I can get / apply for AWS credits? 💸💸 Keeping infini-gram alive costs quite some money, mostly SSD rental. If you're a fan of keeping open LLM training data readily inspectable, please reply / DM me some pointers! 🧵1/4

3,962

CLS

Tong Chen retweeted

CLS

@ChengleiSi

Jan 22

Can LLMs automate frontier LLM research, like pre-training and post-training? In our new paper, LLMs found post-training methods that beat GRPO (69.4% vs 48.0%), and pre-training recipes faster than nanoGPT (19.7 minutes vs 35.9 minutes). 1/

140

585

110,525

Augmented Mind Podcast

Tong Chen retweeted

Augmented Mind Podcast

@augmind_fm

Jan 21

AI used to be a distant promise; now it permeates our lives. AI is getting better, but is it making us better? We are promised that AI will augment our minds, but how? We--@EchoShao8899, @shannonzshen, and @michaelryan207--are excited to launch the Augmented Mind Podcast (The AM Podcast), a podcast about technical human-centered AI work. We'll share compelling research, infrastructure, and systems through monthly episodes, featuring interviews with the pioneering minds behind them. We release EP0 today to share who we are, why we started this podcast, and what we're looking forward to. 0:00 - Prelude: the problems we care about 1:48 - Host introduction 2:03 - Why we started the AM Podcast 2:31 - Hot takes on human-centered AI 10:45 - Format of our podcast 11:28 - Unique technical challenges in human-centered AI 16:45 - Let the journey begin!

17:23

67,746

Liwei Jiang

Tong Chen retweeted

Liwei Jiang @liweijianglw

4 Dec 2025

Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)

Liwei Jiang @liweijianglw

29 Oct 2025

⚠️Different models. Same thoughts.⚠️ Today’s AI models converge into an 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐇𝐢𝐯𝐞𝐦𝐢𝐧𝐝 🐝, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 𝐃&𝐁 𝐎𝐫𝐚𝐥 𝐩𝐚𝐩𝐞𝐫 (✨𝐭𝐨𝐩 𝟎.𝟑𝟓%✨) dives deep into this phenomenon, introducing 𝐈𝐧𝐟𝐢𝐧𝐢𝐭𝐲-𝐂𝐡𝐚𝐭, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories 31K dense human annotations (𝟐𝟓 𝐢𝐧𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐨𝐫𝐬 𝐩𝐞𝐫 𝐞𝐱𝐚𝐦𝐩𝐥𝐞) to push AI’s creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. 📍Paper: arxiv.org/abs/2510.22954 📍Data: huggingface.co/collections/l… We also systematically reveal Artificial Hivemind across: 💥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. 💥 𝐃𝐢𝐬𝐜𝐫𝐢𝐦𝐢𝐧𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)

777

80,548

Rui Xin

Tong Chen retweeted

Rui Xin @rui_xin31

3 Dec 2025

I'll be at #NeurIPS2025 until 12/7! I work on post-training and reward signals (Spurious Rewards), currently curious about bridging the gap between how humans and LLMs learn. Looking forward to connecting with new and old friends—also exploring summer 2025 internships. DMs open!

15,761

Tong Chen

Tong Chen

@tomchen0

3 Dec 2025

I will be at #NeurIPS2025 12.3–12.7 Looking forward to meeting old and new friends ! ☕️🌮 Recently working on hallucination (Binary RAR) and verbatim memorization (ParaPO), issues that scaling up pretraining cannot simply fix. Also interested in making models learn more like humans: strong generalization, non-scalar rewards, episodic memory, and long-horizon abilities.

4,095

Yiping Wang

Tong Chen retweeted

Yiping Wang

@ypwang61

1 Dec 2025

8B model can outperform AlphaEvolve on open optimization problems by scaling compute for inference or test-time RL🚀! ⭕Circle packing: AlphaEvolve (Gemini-2.0-Flash/Pro) : 2.63586276 Ours (DeepSeek-R1-0528-Qwen3-8B) : 2.63598308 🔗in🧵 [1/n]

201

45,443

Tong Chen

Tong Chen

@tomchen0

28 Nov 2025

PhD applicants — Join Akari’s first cohort of students! Akari's research ranges from careful benchmarking to solid methodology. She always gives sharp feedback while being thoughtful and supportive. She stayed driven throughout her PhD and now brings that same energy to her new lab. I am grateful to learn from her and to work with her — please apply!

Akari Asai

@AkariAsai

25 Nov 2025

1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) 🎓 I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in 🧵

17,464

Akari Asai

Tong Chen retweeted

Akari Asai

@AkariAsai

25 Nov 2025

Exciting DR Tulu updates! 📈 DR Tulu-8B (new RL ckpt) sits on the performance–cost frontier, beating Tongyi DR-30B and matching OpenAI DR/Gemini 3 Pro Search at a fraction of the cost. Now on arXiv. 🖥️ You can run an interactive CLI demo with open code, almost for free. 1/🧵

$“Scatter plot titled ‘Average Performance Across Deep Research Benchmarks.’ X-axis shows cost in USD on a log scale (0.001–10), y-axis shows score (%) from ~0–75. A pink star labeled ‘DR Tulu-8B’ sits in the top-left ‘Best performance/cost’ region (high score, very low cost). Green circles (open models: ASearcher-Web-7B, WebThinker-32B-DPO, WebExplorer-8B, Tongyi DR-30B-A3B) and blue squares (closed models: Gemini 3 Pro Search, GPT-5 Search, OpenAI DR, Ai2 ScholarQA/Claude Sonnet) appear to the right at higher costs and similar or lower scores. Legend: pink star = ours (open), green circle = open model, blue square = closed model. DR Tulu sits on the performance–cost frontier, beating Tongyi DR-30B-A3B and matching OpenAI DR / Gemini 3 Pro Search at a fraction of the cost.$

ALT “Scatter plot titled ‘Average Performance Across Deep Research Benchmarks.’ X-axis shows cost in USD on a log scale (0.001–10), y-axis shows score (%) from ~0–75. A pink star labeled ‘DR Tulu-8B’ sits in the top-left ‘Best performance/cost’ region (high score, very low cost). Green circles (open models: ASearcher-Web-7B, WebThinker-32B-DPO, WebExplorer-8B, Tongyi DR-30B-A3B) and blue squares (closed models: Gemini 3 Pro Search, GPT-5 Search, OpenAI DR, Ai2 ScholarQA/Claude Sonnet) appear to the right at higher costs and similar or lower scores. Legend: pink star = ours (open), green circle = open model, blue square = closed model. DR Tulu sits on the performance–cost frontier, beating Tongyi DR-30B-A3B and matching OpenAI DR / Gemini 3 Pro Search at a fraction of the cost.

Ai2

@allen_ai

18 Nov 2025

Today we’re releasing Deep Research Tulu (DR Tulu)—the first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. 🧭📚

1:02

152

50,577