The recipe for “classic” reasoning benchmarks is simple: text-only, several-hour time horizons, easy to grade, with expert human baselines.
What next? In this week’s Gradient Update,
@GregHBurnham argues it’s as easy as dropping one of these four ingredients.