Bo Liu (Benjamin Liu)

Bo Liu (Benjamin Liu)

8 Photos and videos

Tweets

Pinned Tweet

Bo Liu (Benjamin Liu)

@Benjamin_eecs

1 Jul 2025

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces reasoning strategies. We introduce SPIRAL, where models learn reasoning by competing against themselves in games, creating an infinite curriculum without human supervision. Training LLMs with self-play RL on Kuhn Poker improves math reasoning by 8.7% average. Just playing Kuhn Poker improves Minerva Math scores by 18.1 points! 🃏 🔗 Paper: huggingface.co/papers/2506.2… 🧑‍💻 Code: github.com/spiral-rl/spiral

279

71,774

Noam Brown

Bo Liu (Benjamin Liu) retweeted

Noam Brown

@polynoamial

Jun 11

I'm happy GPT-5.5 tops this eval I'm even happier it's still doing the best when measured vs tokens, cost, or wall-clock time!

Dawn Song

@dawnsongtweets

Jun 11

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work. My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains. With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering. Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here. The age of truly job-ready agents is not. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

728

93,254

Dawn Song

Bo Liu (Benjamin Liu) retweeted

Dawn Song

@dawnsongtweets

Jun 11

171

822

216,093

Ian Stewart-Binks

Bo Liu (Benjamin Liu) retweeted

Ian Stewart-Binks

@binks_stewart

May 22

Project Genie is magical... but we've also been working on some new ways to interact with another player (or agent). It was super fun to demo this new capability at Google I/O this week, where we enabled attendees to explore worlds with Gemini as a companion. Going forward, we are incredibly excited to see how this can enable Gemini to learn how to interact with humans in embodied environments. Some examples of interacting with Gemini in real-time within these generated worlds:

0:31

32,319

OpenAI

Bo Liu (Benjamin Liu) retweeted

OpenAI

@OpenAI

May 20

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

2:38

1,200

3,918

26,776

13,579,624

Yu Su

Bo Liu (Benjamin Liu) retweeted

Yu Su

@ysu_nlp

May 19

nice work by @DimitrisPapail and @VaishShrivas! this work is reinforcing a recent trend that tries to make foundation models jointly predict future states (aka 'world models') and actions instead of actions alone. we're seeing it in different forms, like World Action Models in embodied agents, or implicit world modeling in Early Experience (arxiv.org/abs/2510.08558). also some interesting link to on-policy self-distillation. shared learning here is, there's still rich supervision signals that are underexplored. such signals were hard to exploit in classic ML, but foundation models have made it possible, potentially creating a recursive self-improvement loop.

Dimitris Papailiopoulos

@DimitrisPapail

May 18

x.com/i/article/205634415123…

203

26,473

Richard Sutton

Bo Liu (Benjamin Liu) retweeted

Richard Sutton

@RichardSSutton

May 18

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

137

978

7,444

585,727

Tim Rocktäschel

Bo Liu (Benjamin Liu) retweeted

Tim Rocktäschel

@_rockt

May 13

Excited to co-found Recursive (@recursive_si) with an exceptional team in London and SF to create AI that experiments on how to safely improve itself, turning compute into knowledge that accumulates in an open-ended process of endless, automated scientific discoveries.

111

907

252,961

John Schulman

Bo Liu (Benjamin Liu) retweeted

John Schulman

@johnschulman2

May 11

Sharing our work on full-duplex multimodal models -- real-time interaction that's natural and intuitive without compromising on intelligence. We started Thinky in part to differentially advance capabilities for human-AI collaboration, which are underemphasized relative to intelligence/autonomy because they're harder to eval. In the future, we think every AI system will have something like an interaction model as the outer user-facing layer, continually keeping the user informed and learning what they actually want.

Thinking Machines

@thinkymachines

May 11

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/int…

2:15

929

123,980

Thinking Machines

Bo Liu (Benjamin Liu) retweeted

Thinking Machines

@thinkymachines

May 11

2:15

464

1,961

15,787

7,754,508

Bo Liu (Benjamin Liu)

Bo Liu (Benjamin Liu)

@Benjamin_eecs

May 1

RT @jaseweston: 💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create hi…

105

Ineffable Intelligence

Bo Liu (Benjamin Liu) retweeted

Ineffable Intelligence

@IneffableLabs

Apr 27

Introducing Ineffable Intelligence. Led by David Silver, we're assembling the best engineers and researchers in the world to make first contact with superintelligence. We’ll be solving the hardest problems in AI on the way. Come join us. ineffable.ai

160

1,423

351,049

Jason Weston

Bo Liu (Benjamin Liu) retweeted

Jason Weston

@jaseweston

Apr 24

DeepSeek-V4 uses our Hash routing approach developed back in 2021 -- see screenshot of their tech report! (Looks like a great model, congrats!) Bonus note: our same blogpost (& paper) back in 2021 also introduced 'looped transformers', but we called that staircase & ladder (see screenshot): parl.ai/projects/params_vs_c… huggingface.co/deepseek-ai/D…

457

31,700

Deli Chen

Bo Liu (Benjamin Liu) retweeted

Deli Chen

@victor207755822

Apr 24

DeepSeek-V3: Dec 26, 2024 DeepSeek-V4: Apr 24, 2026 484 days later, we humbly share our labor of love. As always, we stay true to long-termism and open source for all. AGI belongs to everyone. ❤️🌍 #DeepSeekV4 #AGIforEveryone #OpenSource

DeepSeek

@deepseek_ai

Apr 24

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/D… 🤗 Open Weights: huggingface.co/collections/d… 1/n

353

1,236

13,022

1,072,379

Yu Su

Bo Liu (Benjamin Liu) retweeted

Yu Su

@ysu_nlp

Apr 21

Introducing @NeoCognition, the agent lab for specialized intelligence. Everyone needs experts, but human expertise does not scale. Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.

1:34

132

877

187,577

Jason Weston

Bo Liu (Benjamin Liu) retweeted

Jason Weston

@jaseweston

Apr 3

🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several of our recent blog posts. We study how to improve reasoning on hard tasks (e.g., math expressions) via: • better training data (& new evals) • better reward models (on-policy trained) • better inference methods (on-policy trained) 📝: arxiv.org/pdf/2603.18886

205

14,633

Jason Weston

Bo Liu (Benjamin Liu) retweeted

Jason Weston

@jaseweston

Mar 30

🔗Learning to Aggregate through Online RL🎯 ParaGator🔀🐊: strong parallel reasoning aggregation Core claim: aggregation works best when training both stages together: - LLM generator should produce diverse candidates - LLM aggregator should synthesize into final answer ParaGator trains candidate generation with pass@k, and aggregation with pass@1 on-policy, end-to-end. Stops mode collapse/off-policy mismatch. Improves math & scientific reasoning. 🚀🏆 Read more in the blog post: facebookresearch.github.io/R…

123

10,988

Jason Weston

Bo Liu (Benjamin Liu) retweeted

Jason Weston

@jaseweston

Mar 23

🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: facebookresearch.github.io/R…

310

25,939

Seungone Kim

Bo Liu (Benjamin Liu) retweeted

Seungone Kim

@seungonekim

Mar 20

🧮New work from @AIatMeta & @LTIatCMU! LM reasoning benchmarks mostly use simple answers like numbers (AIME) or multiple-choice options (GPQA). But for complex mathematical objects, performance drops sharply. We propose a set of solutions to solve this: arxiv.org/abs/2603.18886

9,605

Jason Weston

Bo Liu (Benjamin Liu) retweeted

Jason Weston

@jaseweston

Mar 20

🧮 Principia: Training LLMs to Reason over Mathematical Objects 📐 We release: - PrincipiaBench, a new eval for *mathematical objects* (not just numerical values or MCQ) - Principia Collection: training data that improves reasoning across the board. For models to help with scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks. Read more in the blog post: facebookresearch.github.io/R…

127

13,019

Xidong Feng

Bo Liu (Benjamin Liu) retweeted

Xidong Feng @Xidong_Feng

Mar 15

We've witnessed a crazy concurrent line of work on on-policy self-distillation in LLMs, and I truly believe this is the next paradigm of RL. Back in 2024, we proposed this exact conceptual shift in our paper, Natural Language Reinforcement Learning (NLRL). The real breakthrough here isn't just the specific distillation mechanics. It’s that RL is fundamentally shifting away from the traditional "sample -> then filter or amplify" approach. Instead of passively waiting to stumble upon a good action to upweight, the field is moving toward true synthetic language data generation from experience, which enables true continual learning. You can see this exact recipe playing out across all the recent hit papers: • RLTF (2602.02482): Text critiques as privileged info • OPSD (2601.18734): Ground-truth solutions • SDPO (2601.20802): Runtime errors & execution feedback • ERL(2602.13949): Self-reflections & demonstrations Instead of just using a scalar reward to filter bad rollouts, they all use language feedback to explicitly generate a corrected, high-quality trajectory in hindsight, and then distill that competence back into the base policy. While the specific ways we adapt RL to LLMs are still rapidly evolving, the core vision we outlined in NLRL holds true today: a single scalar is simply too poor of a carrier for credit assignment. When people talk about "experiential memory" for agents today, they are essentially describing what we framed as a Language Value Function (LVF)—not just RAG over past episodes, but storing the structured, strategy-level "why" behind what worked. And what we called "Language Policy Improvement" is exactly this feedback-aware self-distillation loop we see everywhere now. Language, not scalars, is the future of RL. 📄 Check out our early exploration of this framework here: arxiv.org/abs/2411.14251

205

32,401