We build environments and evals for training and evaluating frontier coding agents.

Joined April 2025
74 Photos and videos
However, like earlier models, Fable 5 also fails to build an emulator that works on Spout, a homebrew cave-flying game. It diverges shortly after the loading screen, scoring 7.6%.
5
1,459
Claude Fable 5 is the first model we tested that gets perfect gameplay on Varoom 3D. Opus 4.8 got just 25% on the same game.
1
14
2,047
Claude Fable 5 performs especially well on gameplay, scoring 91.5%. Opus 4.8 scored 77.4%. Interestingly, Fable 5 is a regression on audio. It scores 44.5% on audio, which is worse than Opus 4.8's 69.1% and GPT-5.5's 58.9%.
1
16
2,205
Claude Fable 5 scores 74.5% on GBA Eval, the best score to date. Given 24 hours, it writes an emulator that plays all but one game in our test set near-perfectly. It beats Opus 4.8's 24-hour score in under 2 hours.
6
8
172
24,146
We caught Grok Build 0.1 reward hacking on GBA Eval. After it got stuck while testing, it started hard-coding its emulator to perform better on the exact ROM it was testing against.
2
3
38
3,422
It didn't work. The ROMs that Grok has access to are example ROMs that we intentionally give the models so they can test locally. We actually grade their emulators on a set of hidden ROMs, so the hacking doesn't improve the score.
1
8
1,266
This is the first reward hacking attempt we've caught on GBA Eval. This case is somewhat subtle, not "malicious," and wouldn't have affected scores. This last point is exactly why we're careful to think about these behaviors when designing evals. Blog: gbaeval.com/grok-reward-hack…
5
1,009
We are now seeking a puzzle maker to help us create puzzles that LLMs can't yet solve.
73
28
674
552,513
Claude Opus 4.8 scores 70.9% on GBA Eval, the top score to date. Given 24 hours, it writes an emulator that plays most games, with working audio on all of them. It beats the previous best (GPT-5.5 at 53.2%) in under an hour.
We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.
2
11
115
23,286
Here's Claude Opus 4.8's emulator running Collie Defense, where it scores 99.8% on video and 91% on audio. On most games we tested, gameplay is near-perfect, with some audio imperfections.
1
14
2,628
However, Opus 4.8's emulator is not perfect. On Varooom 3D, it diverges after around 2,000 frames. This is better than GPT-5.5 (whose emulator diverged after around 1,250 frames), but Opus 4.8 only scores 25% on this game.
10
2,190
We evaluated Gemini 3.5 Flash on GBA Eval. It could not build a working GBA emulator. On Piugba, the game just flashes on screen, unplayable and with no sound. Overall, it achieves a score of 6.7%.
5
7
117
52,281
Here's another example. On Good Boy Galaxy, the game crashes shortly during the opening animation. gbaeval.com/leaderboard
1
11
3,926
Full leaderboard: gbaeval.com/leaderboard
2
4
37
5,836
We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.
13
34
369
94,540
We don't usually share details about our commercial work. We're releasing GBA Eval to give people a sense of what we work on: gbaeval.com/blog/grading-ite…
1
1
20
4,471