In November 2023, Yann LeCun, Thomas Wolf and others from Meta and Huggingface created a benchmark called GAIA, which described itself as: "A benchmark for General AI Assistants that, if solved, would represent a milestone in AI research." Most of the problem solutions were kept private, not released online.
It proposed 466 "real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency."
On the hardest level, the average human score was 87%, while the leading systems scored less than 3%. 10 months later OpenAI released O1-preview, reaching ~30% on that level.
Now in 2026 the human baseline for the hardest level has officially been surpassed, the best agent systems are now scoring 88.9% on GAIAs hardest level (level 3).