Interesting! There are improvements in certain directions: the best out-of-the-box model (GPT 5.5 Pro) got essentially 4/10 correct versus 2/10 last time, and Submission A should have gotten 7/10 except for some API error (see comments on P6) versus 5ish/10 for the best harness last time. For individual performances this is roughly in line with what I expected.
Collectively, the performance was no better than last time -- this falls well under the threshold that I said would be "disappointing for AI". There were far fewer teams than I thought there would be, though. In the planning stage I personally heard about more parties planning to participate than showed up in the end. Would be great if FirstProof could find a way to facilitate more participation without comprising the standards of transparency.
Important takeaway for the masses: math is far from "done"!