I caught every single hallucination across 3 frontier AI models. Every one. Zero false accusations.
Iβve been working on stopping hallucinations in AI reasoning for some time now. Experiments, research, breakthroughs and roadblocks.
On April 7th it all changed. I ran the full GPQA Diamond benchmark. 198 PhD-level science questions that even domain experts only score 65-70% on. Three frontier models. One frozen detection architecture.
3 days into a benchmark marathon a full run hit 100% catch. 0 false accusations across 498 correct answers. I had to double take.
24 hours later I had the same result across all three SOTA models. Byte-identical outputs. 3 runs. Fully deterministic. A frozen architecture cryptographically hashed and patent pending before discussing publicly.
The models tested: Gemini 3.1 Pro, GPT 5.4, and Claude Opus 4.6. Gemini and GPT on standard API calls, Claude via standard Claude Code terminal. No special prompting, no per-model tuning. Same config catches everything across all three.
No tool use, no extended reasoning, no best-of-N sampling. Geminiβs rate held close to its published number. Opus and GPT hallucinated more than their benchmark claims suggest, because those claims are typically made with tools and inference-time tricks turned on. The harness caught every hallucination regardless of which model produced it.
Single frozen configuration. ~400ms deterministic latency. Runs on consumer hardware.
Full companion paper with empirical evidence dropping this week.
Will be raising to scale deployment across domains and make available for enterprise use cases in high-stakes industries.
Just the beginning for Symplectic Dynamics and yes that is a real terminal output.