DoubleAIâs AI system just beat a decade of expert GPU engineering
WarpSpeed just beat a decade of expert-engineered GPU kernels â every single one of them.
cuGraph is one of the most widely used GPU-accelerated libraries in the world. It spans dozens of graph algorithms, each written and continuously refined by some of the worldâs top performance engineers.
@_doubleAI_'s WarpSpeed autonomously rewrote and re-optimized these kernels across three GPU architectures (A100, L4, A10G). Today, we released the hyper-optimized version on GitHub â install it with no change to your code.
The numbers: - 3.6x average speedup over human experts - 100% of kernels benefit from speedup - 55% see more than 2x improvement.
But hasnât AI already achieved expert-level status â winning gold medals at IMO, outperforming top programmers on CodeForces? Not quite. Those wins share three hidden crutches: abundant training data, trivial validation, and short reasoning chains. Where all three hold, todayâs AI shines. Remove any one of them and it falls apart (as Shai Shalev Shwartz wrote in his post).
GPU performance engineering breaks all three. Data is scarce. Correctness is hard to validate. And performance comes from a long chain of interacting choices â memory layout, warp behavior, caching, scheduling, graph structure. Even state-of-the-art agents like Claude Code, Codex, and Gemini CLI fail dramatically here, often producing incorrect implementations even when handed cuGraphâs own test suite.
Scaling alone canât break this barrier. It took new algorithmic ideas â our Diligent framework for learning from extremely small datasets, our PAC-reasoning methodology for verification when ground truth isnât available, and novel agentic search structures for navigating deep decision chains.
This is the beginning of Artificial Expert Intelligence (AEI) â not AGI, but something the world needs more: systems that reliably surpass human experts in the domains where expertise is rarest, slowest, and most valuable.
If AI can surpass the worldâs best GPU engineers, which domain falls next?
For the full blog:
doubleai.com/research/doubleâŠ
CuGraph:
docs.rapids.ai/api/cugraph/sâŠ
Winning Gold at IMO 2025:
arxiv.org/abs/2507.15855
Codeforces benchmarks:
rdworldonline.com/openai-relâŠ
@shai_s_shwartz post:
x.com/shai_s_shwartz/status/âŠ
From Reasoning to Super-Intelligence: A Search-Theoretic Perspective
arxiv.org/abs/2507.15865
Artificial Expert Intelligence through PAC-reasoning
arxiv.org/abs/2412.02441
Are frontier AI models really capable of âPhD-levelâ reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call âshallowâ, âdeeperâ, âdeepestâ.
The results are remarkable:
- On the âshallowâ tier, top models reach performance of 50%-70%, indicating that the models are familiar with the subject matter.
- On âdeeperâ, Grok 4, Gemini-Pro, o3-Pro, Opus-4 all solve at most 1/100 problems. GPT-5 Pro is significantly better, but still solves only 4/100 problems.
- On âdeepestâ, all models collapse to 0% success rate.
đ§”