I implemented zk-autoresearch, based on Karpathy's autoresearch loop, on a production ZK prover, Plonky3.
Soundness review by a Plonky3 engineer is pending before I treat these as final. The methodology finding is already clear, preliminary results below.
Target: Plonky3's NTT implementation — the inner loop of proof generation, already heavily optimized by expert ZK engineers. If the approach doesn't work here, it doesn't work anywhere.
Hardware: Hetzner CCX33, AMD EPYC, AVX512, 8 cores.
Model: I used Claude Sonnet 4.6 deliberately, Opus would have marginal gains at significantly higher cost per iteration. For a loop running potentially 100s of times in future experiments, that tradeoff matters.
74 iterations. Fully autonomous by design, but in this first experiment 2 adjustments were made to the setup (at iterations 5 & 10) to nudge the agents to be more decisive.
- Raised MAX_TOKENS from 8192 to 20000, and added "you must always make a change" as the agent kept hitting the token limit. This unlocked improvements at iterations 6 and 9.
- Added near-miss display in the history prompt, showing reverted experiments within 1.5% as combination candidates. This set up iteration 21, where the agent revisited a failed idea that now worked because the surrounding code changed.
Iteration constraints:
- Each iteration ran correctness tests to prevent faulty proofs. Note: during the run these were compile-level checks; post-run correctness was confirmed via full end-to-end ZK proof generation and verification with Radix2DitParallel on BabyBear (10 tests, all passing).
- Agents were structurally prevented from touching FRI or other soundness-critical components — only dft/src/ and baby-bear/src/ were writable.
3% faster at the target size (2^20) during the experiment. Post-experiment benchmarks across 2^14 to 2^22 showed the optimizations generalized better than expected, particularly at the extremes (see image). The agent only optimized for 2^20.
The known issues (short history window causing agent amnesia, wasted tokens on repo exploration, correctness test targeting wrong package) meant the last improvement was found at iteration 21. Round 2 with these fixed should yield a more consistent staircase pattern over 100 iterations.
All gains came from the agent finding redundant work in the hot butterfly loop: precomputing products, hoisting broadcasts, skipping multiplications by 1. Pure implementation-level work, no algorithmic changes.
6 improvements in 74 iterations. 57 regressions. The full experiment log with every diff, benchmark result, and agent reasoning is auditable.
The agent that found these improvements is not a ZK expert. It reasoned about Rust and Montgomery arithmetic from first principles and found real optimizations in code already written by expert engineers.
ZK has been underexplored for agentic optimization because people worry about agents breaking proof soundness. The concern is real but misapplied here, all 6 changes are mathematically equivalent transformations, verified by end-to-end proof generation and verification. (Soundness review by a Plonky3 engineer is pending)
Round 2 is being prepared with the known issues from Round 1 fixed. Full findings and code will be open sourced after it completes.
If you are ZK team and want to run this, feel free to DM me.
Inspired by
@karpathy autoresearch pattern. First known application to a production ZK prover.