ARC just published new #1 and #2 reproducible SOTA scores on our public leaderboard from
@jeremyberman and
@_eric_pang_. And their code is now open source! My analysis below -- includes suggestions for application layer AI and future research directions.
New SOTA:
- v1: 79.6%, $8.42/task
- v2: 29.44%, $30.40/task
Jeremy and Eric’s approaches share a lot in common:
1. Both use Grok 4 as a base, chosen as the best off-the-shelf AI reasoning system
2. Both implement program synthesis systems on top of the LLM
3. Both use outer refinement loops and test-time adaptation
4. Both use abstraction library learning
5. Both meaningfully improve accuracy over previous reproducible SOTA
6. Both are relatively efficient (5-20X single-shot cost), practical for deployment
7. Both are reproducible and open source, others can build on them
Their approaches vary in details.
Jeremy upgrades his previous-SOTA [2024] which had an LLM writing python code, for one that writes solutions in English (“natural language programs”).
He notes ARC v2 depends on more complex perception and his English-based solution benefits by using a less-precise substrate for reasoning than code.
He also moved beyond brute-force full program search. Now, English instructions are scored on partial tasks (eg. explain single examples), and high scoring explanations are pooled together.
Eric’s approach combines ideas from evolutionary program synthesis and DreamCoder [2020]. But his approach diverges from DreamCoder in several meaningful ways.
First, instead of a symbolic AST substrate used in DreamCoder, his LLM writes and stores full code programs in text in a library, using an accuracy-based heuristic.
Second, instead of hand-crafted an initial library, his program library starts empty. This is promising as it removes a key bottleneck to applying DreamCoder to new domains.
Based on the last 12 months of public progress on ARC, we are building a good picture of the “right core ideas” for AGI. This is very exciting!
For ARC Prize 2025, I’d love to see a team swap Grok 4 with OSS LLMs and work to fit performance into the Kaggle constraints (targeting human efficiency).
For application areas where accuracy matters most (and latency and some cost can be traded), Jeremy and Eric’s open source outer loops should be considered.
And for further research, I encourage folks to take Jeremy and Eric’s ideas as inspiration and combine them with
@lateinteraction DSPy and
@LakshyAAAgrawal GEPA.