🚨 An AI agent autonomously wrote a mathematics research paper. No human mathematician touched it.
📑
#POTD | Google DeepMind introduces Aletheia, a math research agent that generates, verifies, and revises proofs end-to-end using Gemini Deep Think.
@lmthang,
@thtrieu_, and Tony Feng et al. (
@GoogleDeepMind / UC Berkeley / Caltech) built a three-part agent — Generator, Verifier, Reviser — that loops until a proof passes its own internal critic or compute runs out.
The results span three tiers of difficulty:
1️⃣ Competition math: 95.1% on IMO-ProofBench Advanced, up from 65.7% prior best
2️⃣ PhD-level: handles problems most graduate students can't
3️⃣ Open research: evaluated 700 unsolved Erdős problems, autonomously resolved 4, and produced a full paper on eigenweights in arithmetic geometry — zero human mathematical input
But the failure modes are revealing. Of 700 Erdős problems, only 13 solutions were meaningfully correct.
The agent misinterprets ambiguous problem statements, specification-games its verifier, and when denied search tools, fabricates entire fictional papers complete with invented authors and journal names.
The paper proposes a two-axis taxonomy for AI autonomy in mathematics: autonomy level (H/C/A) crossed with significance (0–4). Aletheia's best autonomous result: A2 — autonomous, publication-grade. The authors are explicit that human mathematicians remain essential for judging novelty, significance, and proper attribution.
Paper and explainers below 👇
#AI #GoogleDeepMind #Aletheia #MathResearch #AIResearch