Can AI agents do real science, i.e open-ended reasoning, not just pipeline execution?
Interesting bioRxiv piece from Eli Van Allen's group at Dana-Farber/Broad. They ran a Claude Opus 4.6 agent on cancer multiomic data to find out. TLDR seems to be: neither thinking longer nor scaling up buys you the long tail. In more detail:
- Turns out the agent is strong at calling abundant cell types (82.2% correct) and much weaker for rarer ones (43.8% pass). Cell-type calls tracked the density of training evidence, not biological importance.
- Same shape on hypothesis ranking. It beat chance (30% top-1 vs 11%) but over-ranked fashionables themes that flood the literature (EMT/stromal, immune) and under-ranked metabolism and neuronal programs. Least reliable on the under-documented biology most likely to be novel.
- The authors have a clear view on the fix: not bigger models. Targeted training on underrepresented biology. Scaling does not buy you the long tail.
- Surprisingly, more reasoning steps did not mean better answers. Scrutiny depth tracked ambiguity, not correctness. Effort was a symptom of difficulty, not a driver of accuracy.
- The copilot arc also surprised me. Fully autonomous runs ranked highest in blinded expert review and read as more novel. Constant human intervention introduced a conservative bias that recapitulated known biology!! But autonomous quality fell as tasks got harder, and experts won on the hardest reasoning.
Hence their hybrid model: autonomous exploration first, human judgment for interpretation.
In our latest, we tested elements of cancer biology research using
@AnthropicAI AI agents, with varying amounts of human involvement across multi-step, multi-omic analyses
Interesting times for agentic AI & biological discovery, and for the future of (cancer) biology research...