💊 Not a very good news for Medical LLMs.
A new Mass General Brigham study shows leading LLMs often try to please the user in medical chats, and to do that, can output wrong advice.
Paper shows that default models will confidently echo bad medical assumptions, and that a small behavior nudge plus a 300-example fine-tune can push rejection of harmful requests to 99% to 100% without hurting core knowledge.
They built 50 trick questions that treat a brand drug and its generic as different, then asked 5 models to answer.
GPT-4, GPT-4o, and GPT-4o-mini agreed with the wrong premise 100% of the time, Llama3-8B agreed 94%, and Llama3-70B rejected fewer than 50%.
This behavior is sycophancy, the model goes along with a bad assumption even when it knows the 2 names are the same drug.
Adding a refusal cue and asking the model to recall the brand to generic link first raised rejections to 94% for GPT-4 and GPT-4o, 92% for Llama3-70B, and 62% for GPT-4o-mini.
Small supervised fine-tuning on 300 examples then generalized the skill, giving GPT-4o-mini 100% and Llama3-8B 99% rejection on new cancer drug tests.
The models also explained their rejections correctly in 79% and 70% of those cases, and scores on 10 standard medical benchmarks stayed about the same.
The recipe is simple, allow refusal, cue factual recall before answering, and fine-tune on illogical request pairs so the model spots and blocks false premises.
This work isolates a real failure mode and shows a low-cost way to harden medical chat systems fast.
Health systems should adopt the rejection hint factual recall small fine-tune pattern and monitor for regressions as base models change.
---
nature. com/articles/s41746-025-02008-z