Filter
Exclude
Time range
-
Near
J Tyson retweeted
Firstproof results are out. My main takeaway: GPT5.5pro is a very strong model. 3/4 teams used it. Our Princeton team used Gemini 3.1 with our fall'25 style harness (original version performed very well on IMO problems). But it is clear vanilla prompting of 5.5pro gives very strong --and token-efficient-- results on research level math problems 1stproof.org/assets/docs/rep…

3
33
291
36,641
Replying to @littmath
Interesting! There are improvements in certain directions: the best out-of-the-box model (GPT 5.5 Pro) got essentially 4/10 correct versus 2/10 last time, and Submission A should have gotten 7/10 except for some API error (see comments on P6) versus 5ish/10 for the best harness last time. For individual performances this is roughly in line with what I expected. Collectively, the performance was no better than last time -- this falls well under the threshold that I said would be "disappointing for AI". There were far fewer teams than I thought there would be, though. In the planning stage I personally heard about more parties planning to participate than showed up in the end. Would be great if FirstProof could find a way to facilitate more participation without comprising the standards of transparency. Important takeaway for the masses: math is far from "done"!
3
6
51
6,160