This is one of the more important AI-in-medicine papers I've read in a while (and I read the whole paper), and the result is uncomfortable for anyone betting on specialized clinical tools. NYU Langone group put OpenEvidence and UpToDate Expert AI head to head against GPT-5.2, Gemini 3.1, and Claude Opus 4.6. The frontier general-purpose models won. Every benchmark, every dimension.
(Disclosure: I wrote this post, then used AI to clean it up and make it read better. The ideas are mine)
The part that should get your attention: they included free Google AI Overview as a control, and it scored as well as or better than both clinical tools on real physician queries. The specialized answer engine performed about the same as the summary that pops up when you Google a question.
Now, credit where due. Building a HIPAA-compliant evaluation off 100 real clinical queries with 12 blinded physician raters is a serious undertaking, and they were honest about their own limits, which I appreciate. They flag that HealthBench is an OpenAI benchmark that GPT then won while also sitting on the grading panel, so they tell you to discount it and treat it as supplementary. That's the right call.
But here's where I'd put the asterisk. The frontier models were queried through APIs (computer to computer interfaces). The clinical tools were queried manually through the browser interfaces (like you use it), with all the hidden prompts and formatting that comes with that. OpenEvidence's weakest score was clarity, not correctness. So is the underlying model actually worse, or is the wrapper just presenting answers in a way clinicians liked less? The study can't separate those, and that distinction matters a lot for what we conclude.
One thing worth noting: OpenEvidence isn't something clinicians pay for. It's free for verified docs, same as Doximity's tools, though some health systems pay for BAAs. So this isn't really a "you're wasting subscription dollars" story. It's a "the specialized layer doesn't appear to add accuracy over a frontier model" story, which is the more interesting claim anyway.
What I'm comfortable saying: the specialized clinical wrapper doesn't appear to produce better answers than a frontier model, and might cost you on clarity. What I'm not ready to say: that domain-specific tuning is inferior as an approach. The authors don't claim that either.
This is a snapshot of a field that's moving fast, and the frontier models have already been updated since the study was done. But if you're a health system making decisions here, it's worth a close read.