Over the past year, there’s been a surge of excitement around agentic AI — systems that don’t just answer questions, but can act: reading instructions, running code, designing pipelines, and making decisions.
In biomedicine, this raises a provocative question:
💡 Could the next member of your ML team be an AI agent?
The honest answer — not yet.
Today, we share BioML-bench, a new open benchmark to measure how far today’s agents are from this vision, and what it will take to get there.
📄 Paper :
biorxiv.org/content/10.1101/…
💻 Code:
github.com/science-machine/b…
Why this matters
Biomedical discovery doesn’t happen in a single step.
It’s messy, iterative, and deeply interdisciplinary: cleaning data, choosing models, validating results, integrating diverse domains like genomics, imaging, and clinical records.
Existing evaluations — mostly Q&A or coding challenges — don’t capture this complexity.
We needed a testbed that reflects the real work of biomedical ML.
What we built
BioML-bench is a suite of 24 real biomedical ML tasks where agents must:
--Parse nuanced task descriptions
--Build and train models end-to-end
--Compete against human leaderboards populated by domain experts
It’s the first benchmark designed to ask: Can an agent truly operate like a biomedical data scientist?
What we learned
Our experiments with four different agents — from general-purpose systems to biomedical specialists — reveal a sobering truth:
--Current agents operate at ~35% of human expert performance.
--Domain specialization alone isn’t enough. Success comes from flexible, creative strategies, not rigid pipelines.
--Even on imaging tasks, deep learning was underutilized, highlighting a gap between human and agent intuition.
Looking ahead
The promise of agentic AI isn’t to replace human scientists — it’s to amplify them.
Imagine a future where an agent can set up a first-pass analysis overnight, freeing a scientist to focus on questions, not debugging scripts.
We’re not there yet. But with BioML-bench, we now have a shared yardstick to track progress, spark innovation, and bring accountability to this emerging field.
Grateful to our amazing team — led by
@Henrymiller2012 , with contributions from Matthew Greenig, Benjamin Tenmann, and support from
@SciMac.
This work is a small but necessary step toward a future where AI becomes a true partner in biomedical discovery. 🌱
#AI #Biomedicine #Agents #MachineLearning #BioML