Sunday fun, thanks to OpenEvidence and Nature Medicine!
via: Gilles Frydman
@gfry
Nature Medicine published the full peer review file for the paper everyone is fighting about. Both sides are quoting from it selectively. Almost no one is reading the whole thing, which is a shame, because read in full it refuses to hand either side a clean win.
Start with what it shows about how the paper improved. The submitted version had no human evaluation at all. Its conclusion rested entirely on two public benchmarks the reviewers immediately flagged as contaminated and circular: HealthBench was built by OpenAI, graded by an OpenAI model, ranking an OpenAI model. One reviewer wrote that the absence of any blinded physician evaluation severely weakened confidence. So the authors built one. The real-query benchmark with twelve blinded clinicians, the strongest part of the published study and the part even its critics respect, exists because peer review demanded it.
The same pattern runs down the list. Reviewers flagged solo-model grading; the authors moved to a three-model panel. They flagged the missing safety analysis; it was added. The browser-versus-API asymmetry, the contamination risk, the marketing-sourced adoption numbers, the OpenAI overlap, most of it landed in the final paper as plainly stated limitations. A reviewer even raised the point I keep making, that the result may capture one steep moment on the frontier curve rather than a settled order, and it sits in the discussion now. This is revision working the way the textbook promises.
Here is the part the defenders skip. Peer review made the paper honest about its limits. It could not make the headline match them. The title still says outperform. The abstract still generalizes past the tested conditions. The verdict still rests on the one arm no outsider can inspect, the arm added late, the arm a reviewer said should be primary precisely because the benchmark arms were too compromised to carry the claim. That same reviewer warned at the outset that the deepest problems could not be fixed by incremental revision. The paper was then fixed by incremental revision.
Call that gap what it is. Not a scandal, not a failure of review, but the part of the system review cannot reach. Reviewers edit the manuscript. The world reads the title. A careful interior and a quotable exterior live in the same paper, and the interior is where the truth gets qualified while the exterior is what travels.
So read the file before you pick a team. It shows a paper that got meaningfully better and a claim that outran what the process could certify, both at once.
And it shows one more thing, by absence. Across every reviewer comment and every round of revision, no one asked the question that should end any argument about clinical AI: did a patient get better? The reviewers improved how the answer was measured. The patient was never in the room to begin with.
Specialist clinical AI tools are being outperformed by general-purpose models on medical benchmarks. That's the finding worth sitting with.
A 1,000-item benchmark mixing medical knowledge and clinician-alignment tasks put GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5 against OpenEvidence and UpToDate's Expert AI. Generalist models won consistently. GPT-5 came out on top.
This isn't a straightforward win for generalist AI. It raises an uncomfortable question about whether clinical tools are being held to a rigorous enough standard before deployment.
What does it mean for the market if purpose-built clinical AI can't keep pace with models never designed for medicine?