Despite boasting impressive performance across a range of categories, the latest frontier LLMs (Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2) still struggle to balance accuracy and safety on CARDBiomedBench, our biomedical QA benchmark 👀
Frontier models are moving fast, but are they getting better at biomedical research?
We just ran a fresh benchmark update using CARDBiomedBench, our evaluation suite for genetics, disease associations, and drug discovery QA. Instead of looking only at “did it answer?”