A study published today in Science may be the most important AI paper in clinical medicine this year. And it happened to land on the same day I submitted a letter to JAMA arguing that AI can already deliver clinically adequate care for defined tasks.
Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center ran six experiments pitting OpenAI's o1 reasoning model against hundreds of physicians across the full spectrum of clinical reasoning: differential diagnosis, management planning, probabilistic reasoning, and clinical documentation. Then they did something most AI studies don't. They tested it on 76 real, unstructured emergency department cases pulled directly from the medical record at a major academic medical center.
The results across all six experiments: the AI outperformed physicians.
On the real ER cases — the messiest, most clinically relevant test — the AI identified the correct or very close diagnosis in 67.1% of cases at initial triage, 72.4% at ER physician evaluation, and 81.6% at hospital admission. The two attending physicians scored 55.3% and 50.0% at triage, 61.8% and 52.6% at ER evaluation, and 78.9% and 69.7% at admission. The gap was widest at initial triage.
On management reasoning using expert-scored clinical vignettes, the AI scored a median of 89%. Physicians with conventional resources scored 34%. That is not a typo.
The physician evaluators were blinded and could not distinguish AI-generated differentials from human ones. One evaluator guessed correctly 15% of the time. The other guessed correctly 3% of the time.
I'm an emergency physician. I work in a rural Texas ED. These are my cases. These are my decision points. And I can tell you that the triage finding is the one that matters most. Triage is where the least information meets the highest stakes — where the wrong call means a patient sits in the waiting room while their sepsis progresses or their STEMI evolves. The AI was 12 to 17 percentage points better than experienced attendings at exactly that moment.
The authors are careful to note this is text-based reasoning only; the AI doesn't see the patient's distress, doesn't hear breath sounds, doesn't read the room. Those are real limitations today. But the cognitive reasoning component of emergency medicine — pattern recognition under uncertainty with incomplete data — is precisely what this model is demonstrating it can do.
This was published in Science. Not a preprint. Not a company blog post. Peer-reviewed, in one of the two most prestigious scientific journals in the world.
The profession needs to stop debating whether AI will be good enough. It needs to start planning for the fact that, for an expanding set of clinical reasoning tasks, it already is.
And yes, this was written with AI. Sorry!!