Is OpenEvidence already obsolete? A study in Nature found that ChatGPT, Claude, and Gemini all outperformed OE and UpToDate AI on medical benchmarks and real-world physician use. My 6 thoughts…
This week researchers from NYU Langone published a head-to-head evaluation of OE and UpToDate AI against GPT, Gemini and Claude. Here's what we know:
→ Frontier LLMs outperformed clinical AI tools across ALL THREE evaluations - medical knowledge (MedQA), expert clinician alignment (HealthBench) and real-world physician queries (RCQ)
→ The RCQ benchmark is the most clinically meaningful part: 100 actual queries submitted by physicians during routine care, scored by 12 blinded clinicians across clinical correctness, completeness, safety, and clarity. Yes, OE/UTD score worse than the frontier LLMs on these too!
→ Physician reviewers could annotate errors (e.g. factual, hallucinations) on any low-scoring response. Gemini had 8, GPT had 21, Claude had 19.
→ OpenEvidence had the most errors at 52 - mostly incomplete clinical content, safety-critical omissions, and disorganization.
→ UpToDate Expert AI refused to answer 19% of real queries entirely - by far the highest refusal rate
Alright, my 6 thoughts:
1/ Only 3 of the 12 physicians scored each of the 100 real world question/model answers - so it's not a substantial sample size, but it does make one think. I'm not surprised - I often run clinical questions against OE, Doximity, and Gemini - and I often find Gemini just as good if not sometimes better.
2/ I wasn't surprised that OE and UTD were considered inferior in some ways. I too find OE often produces info in a disorganized way, which is why I've found Doximity more user friendly in general (so I'm disappointed they weren't included).
3/ I share the same challenges with UTD - because it only uses curated clinician content as a knowledgebase, it's more likely to have incomplete responses or refuse to answer. While I respect that this is for safety reasons, I find it frustrating to not get an answer 100% of the time. Also, curated-only knowledgebases are consistently out of date.
4/ We need a larger variety of physicians and sample size of real-world queries to truly compare. Were these all primary care questions? How would the head-to-head go for specialty-based questions?
5/ The study did NOT assess citation quality or retrieval of latest evidence/publications - frontier LLMs would likely fare much worse here due to a propensity to hallucinate references and lack of access to NEJM, JAMA, etc. unlike OE. Consumers won't get this, but clinicians do care.
6/ I predict these findings will NOT affect physician or health system adoption of OE, UTD, etc. For safety/liability reasons and access to latest evidence, purpose-built CDS AI tools will remain the most used. Who is more likely to lose in a lawsuit - a doctor who used OE which licenses from NEJM and provides a BAA, or a doctor who used Gemini but cites this Nature study as justification?