Competition here is healthy. Credit first: independent study, public code, and the main benchmark uses real physician questions rated by blinded clinicians. That part is strong.
Three issues.
1️⃣ (Others have flagged this.) The two sides were tested differently. Frontier models: API, temperature zero, search on. Clinical tools: queried by hand in their browsers, with hidden prompts and retrieval the authors could not control. They say so. And search gave the frontier models retrieval, the thing clinical tools are built on. Not specialized vs general. Two systems with retrieval, different doors.
2️⃣ The weight rests on one benchmark, not three. MedQA and HealthBench carry contamination risk. HealthBench was built by OpenAI, its model won it, and answers were graded by the same three models being scored. The authors call the clinician rated benchmark primary, HealthBench supplementary. On MedQA, Claude tied both clinical tools.
3️⃣ Outperform means quality, not safety. No model was safer than another. The highest harmful response rate was a frontier model. The floor the clinical tools missed was a free search feature, not only purpose built tools.