Mechanize

Mechanize

76 Photos and videos

Tweets

Mechanize

@MechanizeWork

12h

Youtube: youtube.com/watch?v=UpO70AJG… Substack: mechanizework.substack.com/p… Spotify: open.spotify.com/show/033krx…

How bad data teaches models to write terrible code

Max Niederman, Stephen Yang, and Ege Erdil talk about evals: what t...

youtube.com

1,740

Mechanize

Mechanize

@MechanizeWork

12h

Our new podcast on evals, with Max Niederman, Ege Erdil, and Stephen Yang. 0:00:00 – What's an eval, and how's it different from an RL environment? 0:19:33 – Why are models bad at building an emulator when the task is fully verifiable? 0:42:00 – How does training on bad data teach models to write terrible code? 1:04:00 – Why is continual learning still so bad? 2:25:24 – Why haven't software engineers been replaced when coding is basically solved? Listen to the Mechanize Podcast on YouTube, Spotify, etc. Enjoy!

2:42:04

171

18,080

Mechanize

Mechanize

@MechanizeWork

Jun 10

However, like earlier models, Fable 5 also fails to build an emulator that works on Spout, a homebrew cave-flying game. It diverges shortly after the loading screen, scoring 7.6%.

1:00

2,065

Mechanize

Mechanize

@MechanizeWork

Jun 10

Claude Fable 5 is the first model we tested that gets perfect gameplay on Varoom 3D. Opus 4.8 got just 25% on the same game.

1:00

2,467

Mechanize

Mechanize

@MechanizeWork

Jun 10

Claude Fable 5 performs especially well on gameplay, scoring 91.5%. Opus 4.8 scored 77.4%. Interestingly, Fable 5 is a regression on audio. It scores 44.5% on audio, which is worse than Opus 4.8's 69.1% and GPT-5.5's 58.9%.

2,713

Mechanize

Mechanize

@MechanizeWork

Jun 10

Claude Fable 5 scores 74.5% on GBA Eval, the best score to date. Given 24 hours, it writes an emulator that plays all but one game in our test set near-perfectly. It beats Opus 4.8's 24-hour score in under 2 hours.

173

24,496

Mechanize

Mechanize

@MechanizeWork

Jun 9

We caught Grok Build 0.1 reward hacking on GBA Eval. After it got stuck while testing, it started hard-coding its emulator to perform better on the exact ROM it was testing against.

3,609

Mechanize

Mechanize

@MechanizeWork

Jun 9

It didn't work. The ROMs that Grok has access to are example ROMs that we intentionally give the models so they can test locally. We actually grade their emulators on a set of hidden ROMs, so the hacking doesn't improve the score.

1,409

Mechanize

Mechanize

@MechanizeWork

Jun 9

This is the first reward hacking attempt we've caught on GBA Eval. This case is somewhat subtle, not "malicious," and wouldn't have affected scores. This last point is exactly why we're careful to think about these behaviors when designing evals. Blog: gbaeval.com/grok-reward-hack…

GBA Eval - Build a Game Boy Advance emulator in WebAssembly from scratch

Frontier AI coding agents try to write a Game Boy Advance emulator from scratch. Their emulators are graded against Mesen2.

gbaeval.com

1,136

Mechanize

Mechanize

@MechanizeWork

Jun 3

We are now seeking a puzzle maker to help us create puzzles that LLMs can't yet solve.

675

552,960

Mechanize

Mechanize

@MechanizeWork

Jun 3

Apply here: mechanize.work/apply/puzzle-…

Puzzle Maker

Design interesting and original puzzles that LLMs can't yet solve

mechanize.work

14,328

Mechanize

Mechanize

@MechanizeWork

Jun 1

Claude Opus 4.8 scores 70.9% on GBA Eval, the top score to date. Given 24 hours, it writes an emulator that plays most games, with working audio on all of them. It beats the previous best (GPT-5.5 at 53.2%) in under an hour.

Mechanize

@MechanizeWork

May 14

We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.

0:30

115

23,481

Mechanize

Mechanize

@MechanizeWork

Jun 1

Here's Claude Opus 4.8's emulator running Collie Defense, where it scores 99.8% on video and 91% on audio. On most games we tested, gameplay is near-perfect, with some audio imperfections.

1:00

2,681

Mechanize

Mechanize

@MechanizeWork

Jun 1

However, Opus 4.8's emulator is not perfect. On Varooom 3D, it diverges after around 2,000 frames. This is better than GPT-5.5 (whose emulator diverged after around 1,250 frames), but Opus 4.8 only scores 25% on this game.

1:00

2,239

Mechanize

Mechanize

@MechanizeWork

May 28

We are seeking research engineers who will build evals that test for misaligned model behavior.

169

69,954

Mechanize

Mechanize

@MechanizeWork

May 28

Apply here: mechanize.work/apply/researc…

Research Engineer, Alignment

Build evals that test for misaligned model behaviors. $500K plus equity and bonuses. In person, San Francisco.

mechanize.work

3,199

Mechanize

Mechanize

@MechanizeWork

May 21

We evaluated Gemini 3.5 Flash on GBA Eval. It could not build a working GBA emulator. On Piugba, the game just flashes on screen, unplayable and with no sound. Overall, it achieves a score of 6.7%.

0:30

116

52,340

Mechanize

Mechanize

@MechanizeWork

May 21

Here's another example. On Good Boy Galaxy, the game crashes shortly during the opening animation. gbaeval.com/leaderboard

0:30

3,957

Mechanize

Mechanize

@MechanizeWork

May 21

Full leaderboard: gbaeval.com/leaderboard

5,866

Mechanize

Mechanize

@MechanizeWork

May 14

0:30

369

94,804

more replies

Mechanize

Mechanize

@MechanizeWork

May 14

We don't usually share details about our commercial work. We're releasing GBA Eval to give people a sense of what we work on: gbaeval.com/blog/grading-ite…

4,500

Mechanize

Mechanize

@MechanizeWork

May 14

If you'd want to work on problems like this, we're hiring: mechanize.work/apply/

Apply

We're hiring software engineers, a research engineer focused on alignment, a growth manager, a graphic designer, an office manager, a counsel, a recruiter, and an operations generalist. In person,...

mechanize.work

4,599