Just updated our paper on on-device LLMs for clinical decision support.
Paper:
arxiv.org/abs/2601.03266
Here's why I think this matters:
We've been asking the wrong question. The debate around LLMs in medicine has been "how accurate are
they?", but the harder problem is deployment. Patient data can't leave the hospital. Most clinics don't have the bandwidth or budget for cloud inference at scale. The real question is: can a model that runs locally, on modest hardware, actually be trusted for clinical decisions?
After benchmarking 188 models across general disease diagnosis, ophthalmology, and clinical judgment simulation — the answer is yes.
Gemma 4 31B (
@googlegemma) hits 86.5% on general diagnosis, beats GPT-5-mini, scores 100% on uroradiology and breast imaging, and runs at 18 GB. Qwen3.5-27B (
@Alibaba_Qwen ) at 16 GB matches DeepSeek-R1 at 671B, that is one-twenty-third the memory, same clinical accuracy. Fine-tune Qwen3.5-35B with domain-specific reasoning traces and it reaches 87.9%, approaching GPT-5.1 (89.4%). No extra memory. No cloud call. No PHI leaving the building.
One thing that surprised me: 87.2% of errors across all models were clinically plausible differentials. the model picked a reasonable diagnosis, just not the right one. Above ~31B parameters, hallucination rate drops to zero. Errors start looking like the kind a careful clinician makes on a hard case, not the kind that would make you distrust the system.
There's also a pass@3 upper bound of 93.2% for fine-tuned Qwen3.5-35B. The model already "knows" the right answer in most cases. That's a verifier problem, not a model-size problem.
Gemma 4 and Qwen3.5 are the first generation where the local deployment story actually holds up under rigorous clinical benchmarking. That's a real milestone.
Huge shoutout to the team who made this happen: Alif Munim (
@alifmunim ), Omar Ibrahim, Alhusain Abdalla, Jun Ma
@JunMa_AI4Health (all equal contributors), Meng Wei, Shuolin Yin, and Leo Chen from
@UHN AI hub. Proud of what this group built 🔥🔥