Humanity’s Last Exam, or HLE, is a 2,500-question multimodal test designed to measure expert-level reasoning.
- xAI states that Grok 4 Heavy is the first model to cross the 50% mark on HLE,
- OpenAI’s o3 (high configuration) records 20.32% on the same leaderboard, while an o3 variant optimized for data retrieval reaches 26.6% in OpenAI’s own release notes.
- OpenAI’s o3 at 20-27% answers about 1 question in 4, so ScienceOne currently doubles o3’s demonstrated accuracy.
- Claude’s 10-11% means it answers only 1 question in 9, giving ScienceOne almost a 4× edge.
- Scores reflect strict exact-match grading, so even small percentage differences matter.