๐๐ฎ๐ป ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ ๐ฏ๐ถ๐ผ๐บ๐ฒ๐ฑ๐ถ๐ฐ๐ฎ๐น ๐ฑ๐ฎ๐๐ฎ ๐ฎ๐ป๐ฎ๐น๐๐๐ถ๐ ๐๐ฎ๐๐ธ๐ ๐ฏ๐ฒ๐ต๐ถ๐ป๐ฑ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ๐ ๐ถ๐ป ๐ก๐ฎ๐๐๐ฟ๐ฒ, ๐๐ฒ๐น๐น, ๐ฎ๐ป๐ฑ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ?
To find out, we built ๐๐ถ๐ผ๐บ๐ป๐ถ๐๐ฒ๐ป๐ฐ๐ต, a benchmark we co-developed with the original paper authors and 5 year domain experts to grade AI agents the way a peer reviewer reads a paper: scrutinizing methods, reasoning, and every analytical choice, not just the final answer.
As the first track of this benchmark, ๐๐ถ๐ผ๐บ๐ป๐ถ๐๐ฒ๐ป๐ฐ๐ต-๐๐ฎ๐๐ฎ๐๐ป๐ฎ๐น๐๐๐ถ๐ contains 100 data-analysis tasks drawn directly from 21 published studies in Nature, Cell, Science, Nature Medicine, and other leading journals. Each task hands the agent a real dataset and a research question, then scores its full analytical trajectory against an expert-authored rubric.
What's inside:
- ๐ญ๐ฌ๐ฌ ๐๐ฎ๐๐ธ๐ ๐ฎ๐ฐ๐ฟ๐ผ๐๐ ๐ฑ ๐ฑ๐ถ๐๐ฒ๐ฎ๐๐ฒ ๐ฎ๐ฟ๐ฒ๐ฎ๐ (๐ผ๐ป๐ฐ๐ผ๐น๐ผ๐ด๐, ๐ถ๐บ๐บ๐๐ป๐ผ๐น๐ผ๐ด๐, ๐ป๐ฒ๐๐ฟ๐ผ๐น๐ผ๐ด๐, ๐บ๐ฒ๐๐ฎ๐ฏ๐ผ๐น๐ถ๐ฐ & ๐ฒ๐ป๐ฑ๐ผ๐ฐ๐ฟ๐ถ๐ป๐ฒ, ๐ฐ๐ฎ๐ฟ๐ฑ๐ถ๐ผ๐๐ฎ๐๐ฐ๐๐น๐ฎ๐ฟ) ๐ฝ๐น๐๐ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐ฏ๐ถ๐ผ๐น๐ผ๐ด๐
- ๐ญ๐ณ ๐ฎ๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ฎ๐น ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐ (๐ฒ.๐ด., ๐๐ช๐๐ฆ/๐ฒ๐ค๐ง๐ ๐ฐ๐ผ๐น๐ผ๐ฐ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป, ๐ง-๐ฐ๐ฒ๐น๐น ๐ฟ๐ฒ๐ฐ๐ฒ๐ฝ๐๐ผ๐ฟ ๐ฟ๐ฒ๐ฝ๐ฒ๐ฟ๐๐ผ๐ถ๐ฟ๐ฒ ๐ฎ๐ป๐ฎ๐น๐๐๐ถ๐, ๐ฐ๐ฒ๐น๐น-๐ฐ๐ฒ๐น๐น ๐ฐ๐ผ๐บ๐บ๐๐ป๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป)
- ๐๐ป ๐ฒ๐
๐ฝ๐ฒ๐ฟ๐-๐ฐ๐๐ฟ๐ฎ๐๐ฒ๐ฑ ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ณ๐ผ๐ฟ ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฎ๐๐ธ, ๐๐ฐ๐ผ๐ฟ๐ถ๐ป๐ด ๐ฒ ๐ฑ๐ถ๐บ๐ฒ๐ป๐๐ถ๐ผ๐ป๐ ๐ผ๐ณ ๐ฎ๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ฎ๐น ๐พ๐๐ฎ๐น๐ถ๐๐
- ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐-๐น๐ฒ๐๐ฒ๐น ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐ต ๐ณ๐ฟ๐ผ๐ป๐๐ถ๐ฒ๐ฟ ๐๐๐ ๐ (๐๐ฃ๐ง-๐ฑ.๐ฑ, ๐๐น๐ฎ๐๐ฑ๐ฒ ๐ข๐ฝ๐๐ ๐ฐ.๐ณ, ๐ฎ๐บ๐ผ๐ป๐ด ๐ผ๐๐ต๐ฒ๐ฟ๐) ๐ฎ๐ฐ๐ฟ๐ผ๐๐ ๐ฐ ๐ฎ๐ด๐ฒ๐ป๐ ๐ต๐ฎ๐ฟ๐ป๐ฒ๐๐๐ฒ๐ (๐๐น๐ฎ๐๐ฑ๐ฒ ๐๐ผ๐ฑ๐ฒ, ๐๐ผ๐ฑ๐ฒ๐
๐๐๐, ๐ง๐ฒ๐ฟ๐บ๐ถ๐ป๐๐-๐ฎ, ๐๐ฒ๐บ๐ถ๐ป๐ถ ๐๐๐)
Headline results:
- ๐๐ฟ๐ผ๐ป๐๐ถ๐ฒ๐ฟ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐น๐ฒ๐ฎ๐ฑ ๐ฎ๐ ๐ณ๐ฏ.๐ฏ/๐ญ๐ฌ๐ฌ, ๐๐ถ๐๐ต ๐๐๐ฏ๐๐๐ฎ๐ป๐๐ถ๐ฎ๐น ๐ต๐ฒ๐ฎ๐ฑ๐ฟ๐ผ๐ผ๐บ ๐๐ผ ๐ถ๐บ๐ฝ๐ฟ๐ผ๐๐ฒ.
- ๐ง๐ต๐ฒ ๐ฎ๐ด๐ฒ๐ป๐ ๐ต๐ฎ๐ฟ๐ป๐ฒ๐๐ ๐บ๐ฎ๐๐๐ฒ๐ฟ๐ ๐ฎ๐ ๐บ๐๐ฐ๐ต ๐ฎ๐ ๐๐ต๐ฒ ๐ฏ๐ฎ๐๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น.
- ๐๐ด๐ฒ๐ป๐๐ ๐ณ๐ฎ๐น๐น ๐๐ต๐ผ๐ฟ๐ ๐ผ๐ป ๐ฏ๐ถ๐ผ๐น๐ผ๐ด๐ถ๐ฐ๐ฎ๐น ๐ถ๐ป๐๐ฒ๐ฟ๐ฝ๐ฟ๐ฒ๐๐ฎ๐๐ถ๐ผ๐ป, ๐บ๐ฒ๐๐ต๐ผ๐ฑ ๐๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐ผ๐ป, ๐ฎ๐ป๐ฑ ๐๐ฐ๐ถ๐ฒ๐ป๐๐ถ๐ณ๐ถ๐ฐ ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ถ๐ป๐ด.
We hope to make ๐๐ถ๐ผ๐บ๐ป๐ถ๐๐ฒ๐ป๐ฐ๐ต the most helpful benchmark for biologists to understand how AI agents handle real-world biomedical tasks: where they can be trusted, and where they fall short. We're actively expanding our evaluation effort, and would love to engage the broader scientific community on what comes next.
๐
biorxiv.org/content/10.64898โฆ
๐ค
huggingface.co/datasets/phylโฆ
Thanks to our amazing
@phylo_bio team (Minta Lu,
@TuXinming ,
@serena2z ,
@TianweiShe ,
@lecong ,
@jure ,
@KexinHuang5 ) and our collaborators at
@LaudeInstitute ,
@Stanford ,
@Harvard ,
@PKU1898 ,
@virginia_tech , Humanlaya Data Lab, Xbench:
@alexgshaw , JOU-HO SHIH, Bingqing Zhao, Minjie Shen, Haochen Yang, Jielin Yan, Rongchuan Zhang, Xinze Wu, Tingting Li, Xiaobo Hu, Yuan Jiang, Jiayun Dong, Tao Peng.