CS Ph.D. at @SCSatCMU. Funded by @NDSEG Fellowship. Intern @arena. Editor at blog.ml.cmu.edu.

Joined July 2013
39 Photos and videos
Pinned Tweet
New preprint alert ๐Ÿšจ Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /๐Ÿงต
19
27
257
26,273
I miss when frontier labs tackled fun benchmarks. Fable 5โ€™s system card has 50 benchmarks, yet almost all of them feel... sterile. Where are the creative writing evals? Donโ€™t we want to know if models can actually make something delightful? Creative industries are huge, profitable, and technically demanding, perfect for frontier evals. Game dev, for example, tests long-context reasoning, multimodal understanding, tool-use, and taste. Obviously, Iโ€™d love to see GameDevBench in the mix. But really, even one fun or creative benchmark would be refreshing. The world could use a little more whimsy. ๐Ÿช…
2
24
1,457
This is ๐Ÿฅœ btw
Jun 10
Claude Fable 5 ranks #1 in Code Arena: Frontend, leading by a wide margin over Opus-4.8. Highlights: - #1 in every sub leaderboard: HTML, React - #1 in every sub category: Brand & Marketing, Reference-Based Design, Data & Analytics, Consumer Product, Gaming, Simulations, and Content Creation Tools. Huge congrats to @AnthropicAI for this milestone! The thread breaks down how Claude Fable 5 ranks across single-modality arenas.
1
1
423
Things you can do this summer while Claude Code works for you!
Things you should do this summer in San Francisco NorCal instead of sitting inside with Claude Code: ๐ŸŒ Grab fresh oysters from Tomales Bay ๐ŸŒ Pick strawberries, cherries, & blackberries in Brentwood ๐ŸŒ Walk the SF Crosstown trail ๐ŸŒ Picnic in Dolores, GGP, Lafayette, Crissy Fields, or Alamo Square parks ๐ŸŒ Go on a wine tour in Sonoma & Napa ๐ŸŒ Take the ferry from SF to Sausalito ๐ŸŒ Drive down to Santa Cruz and grab burritos on the beach ๐ŸŒ Visit the Ferry Building or Fort Mason farmers markets ๐ŸŒ Take a trip to Muir Woods ๐ŸŒ Drive down Highway 1 for insane views ๐ŸŒ Explore or camp in Carmel ๐ŸŒ Hike Mission Peak in Fremont ๐ŸŒ Visit Yosemite/Halfdome ๐ŸŒ Golf in Half Moon Bay ๐ŸŒ Polar Plunge at Aquatic Park or Ocean Beach ๐ŸŒ Dine at the Taco Bell Cantina in Pacifica
1
239
Agents are taking over!
Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K tasks, 2M tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.
1
14
1,441
I've joined @arena for the summer where I'll be working on ... something new and secret ๐Ÿ˜ Super excited to work with @ml_angelopoulos, @infwinston, and @istoica05 again!
6
2
49
4,434
Alright I guess it's time to test Opus 4.8 on GameDevBench
Opus4.8ใ™ใ”ใ„ใ™ใญโ€ฆโ€ฆโ€ฆโ€ฆ
1
3
1,094
From my experience doing both, this is the most accurate differentiator. Most of the differences in methods and skills stem from this.
IMO a researcher studies a problem that may not be solvable, while an engineer solves a problem that is considered solvable.
5
840
Wayne Chi retweeted
Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/
139
921
6,555
1,088,708
We observed this a month or two ago on GameDevBench! Ever since GPT 5.4, @OpenAI took over as the best agent for game development. However @AnthropicAI was never in the lead; the best was actually @Google with Gemini (good at multimodal understanding). Good to see further confirmation on what's SOTA for game development.
Fun fact, GPT 5.5 is very good at Game Dev Game Dev is the notable category where @OpenAI consistently beats out @AnthropicAI's Claude models Upon code inspection, our @Designarena team found that GPT 5.5's frontend verbosity plays in its favor for game dev - it consistently created games with the most functional features Congrats to @OpenAI for establishing the new Game Dev frontier!
3
1
12
2,464
A big downside with the the new focus on ArXiv is you have to read (and eventually cite) some absolutely awful papers that would clearly never pass peer review...
6
699
I love how southern Jensen sounds when he says America. 'Murica!๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿฆ…๐Ÿฆ…๐Ÿฆ…
1
170
GameDevBench has been accepted into ICML 2026! See everyone in Seoul soon!
New preprint alert ๐Ÿšจ Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /๐Ÿงต
4
28
1,472
Exciting work and really cool to see Moonlake reference GameDevBench as a precursor to their work! The future of agentic game development is bright โ˜€๏ธ
Introducing Moonlake's 3D Agent. Our agent acts like a technical artist that can build and reconstruct articulated assets and large-scale editable scenes with hundreds of objects from a single image and can improve its generations continuously. Learn more in the thread below.
16
1,032
The presenters in front of me took 15 minutes instead of 10 minutes each. And then the conference organizer CUT MY QUESTIONS??? wtf @iclr_conf
1
2
30
6,174
And they're cutting the next presenter's questions too???
4
684
I will be presenting EDIT-Bench as an Oral at ICLR on Friday 4/23! Session 4D starts at 3:15 and the talk is at 3:39. We will also be at poster session 3 in the morning. See you all there!
19 Nov 2025
Tired of evaluating LLMs on made-up problems that look nothing like real tasks? Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode. Real-world edits are challenging: ๐—ผ๐—ป๐—น๐˜† ๐Ÿญ/๐Ÿฐ๐Ÿฌ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐˜€๐—ฐ๐—ผ๐—ฟ๐—ฒ > ๐Ÿฒ๐Ÿฌ% ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ.
8
31
4,443
I think I might be addicted to making benchmarks... evaluating LLMs is, for some strange reason, incredibly fun... Anyways new benchmark coming soon!
1
12
437
Wayne Chi retweeted

18
48
376
77,849
No more benchmarks. Only tier lists going forward
Benchmarks? Where weโ€™re going, we donโ€™t need benchmarks.
1
11
1,329
Slay the Spire 2 is having one of the most successful launches in indie gaming history... And it's made entirely in Godot I think Godot will have a meteoric rise in the coming years and it's a big reason why I focused GameDevBench (arxiv.org/abs/2602.11103) on Godot
Indie game Slay the Spire 2 has surpassed 500,000 concurrent players on Steam The rougelike is now in the top 20 games with highest all-time peaks on Valve's platform
1
1
8
591