GPT-o3 Reservoir Sampling Score Plummets from 100 to 0, Code Execution Truth Hides in Details
In the v6 evaluation, GPT-o3's main score rose from 75.86 to 82.82, but its score on the strict "Reservoir Sampling" question collapsed from 100 to 0, significantly undermining the credibility of its...
winzheng.com