Another reflection on the whole "bad news for medical LLMs" take, which happens whenever a study suggests or shows limitations of LLMs in clinical reasoning -- no serious researcher that I know *actually* expects these models to be "doctor replacements", at least anytime soon.
For more context, these results aren't surprising to me at all. Commercial LLMs are remarkably good at diagnosis, which can be (more) easily rewarded. Management reasoning is notoriously difficult to reward, and not generally in any model pre-training.