2 quick updates, and look-ahead, exactly a year on from first testing models on Simple-Bench:
1) Claude 4 busted our rate limits, and my entreaties to
@AnthropicAI (to allow us to spend more money!) have yet to bear fruit. A shame, as am fairly confident Opus 4 would be SOTA.
2) Gemini 2.5 Pro 05-06 and Flash 05-20 (the latest versions) are actually a slight downgrade in both performance and instruction-following and the one full run we got out of 2.5 Pro got 46% (below the previous version's 51%). We would prefer to get an AVG@5, for fairness, before posting on the leaderboard.
Thoughts: RL becoming 20% of the compute spend for frontier models may have more strange side effects than labs were anticipating. 'Over-eagerness' over simply following commands seems barely under control.
On Simple, I had been fairly confident it would be saturated (>80-85%) by the end of the year. Now I think it is more like 50-50, and progress could instead slow for a while, as models become relentlessly optimised for dollar-maximising tasks, like software engineering, over general nous. Spatial intelligence, like spotting that the glove would fall onto the road, in the question pasted at the bottom of this tweet, is simply not yet as lucrative.
As ever, grateful to
@weights_biases and
@Ag_Mlynarczyk in particular for keeping the show on the road.
Q. A luxury sports-car is traveling north at 30km/h over a roadbridge, 250m long, which runs over a river that is flowing at 5km/h eastward. The wind is blowing at 1km/h westward, slow enough not to bother the pedestrians snapping photos of the car from both sides of the roadbridge as the car passes. A glove was stored in the trunk of the car, but slips out of a hole and drops out when the car is half-way over the bridge. Assume the car continues in the same direction at the same speed, and the wind and river continue to move as stated. 1 hour later, the water-proof glove is (relative to the center of the bridge) approximately?
Models (super-trained on HS Math): 4km East