ARC-AGI-3 scoring improvements
Since previewing ARC-AGI-3, nearly one million scorecards have been submitted on public environments. That real-world data helps us stress-test and harden our scoring approach
Based on what we’ve observed, we’re announcing two updates to ARC-AGI-3 scoring:
1. The per-level baseline is now less sensitive to outlier performances, reducing the impact of luck on individual levels
A single unusually efficient human run no longer defines the baseline for ARC-AGI-3 scoring. Rather the baseline now reflects more typical human play. Technical change: the human baseline which normalizes scores moves from 2nd-best player to median player per level
2. A single subpar level no longer disproportionately drags down an overall score
A test taker who generalizes well across an entire environment is no longer penalized by a single, sub-par, level. Technical change: per-level score cap increases from 100% to 115%
For a view of how action efficiency translates into scores, see how the 11 human players who played re86 during testing