BIODSA-1K: Benchmarking Data Science Agents for Biomedical Research
1.BIODSA-1K introduces the largest and most comprehensive benchmark to date for evaluating AI agents on realistic biomedical hypothesis validation tasks. It features 1,029 hypothesis-driven tasks and 1,177 structured analysis plans derived from over 300 published biomedical studies.
2.Each benchmark task is grounded in real-world scientific claims and their empirical evidence, capturing the full research pipeline: hypothesis formulation, analysis planning, code execution, and conclusion. This allows holistic evaluation of agents’ scientific reasoning, coding skills, and evidence interpretation.
3.The benchmark uniquely includes non-verifiable hypotheses—cases where available data are insufficient to confirm or reject a claim. This reflects the ambiguity of real-world science and challenges agents to avoid overconfident conclusions.
4.Tasks span a broad spectrum of biomedical domains (e.g., genomics, molecular, clinical, therapeutic) and data types (e.g., gene expression, mutations, clinical records), with variable table sizes and analytical complexity. This ensures high diversity and realism.
5.BIODSA-1K evaluates AI agents across four key axes: (1) hypothesis decision accuracy (True, False, Not Verifiable), (2) evidence alignment score, (3) quality and executability of generated code, and (4) reasoning fidelity via structured analysis steps.
6.The authors benchmark four AI agent types: single-shot CodeGen (GPT-4o and o3-mini), ReAct, and their reasoning-augmented versions (CodeGen-R, ReAct-R). Reasoning-enhanced agents consistently show lower Type I and Type II errors and higher executable code rates.
7.ReAct-Reasoning achieves the best performance overall: up to 92% accuracy in rejecting non-verifiable hypotheses and highest code executability (86.6%). This shows that structured, iterative planning improves agent robustness in biomedical research settings.
8.The evidence alignment score remains modest across all models (~0.20–0.25), revealing a persistent gap in the agent’s ability to faithfully reproduce human-authored analyses—even when final decisions are correct. This calls for better reasoning and domain adaptation.
9.Common failures include logic errors, variable misuse, and inappropriate statistical choices. Simpler tasks like frequency analysis are handled well, but agents struggle with survival analysis, clustering, and multivariate correlation—highlighting a need for method-aware training.
10.BIODSA-1K sets a new gold standard for biomedical data science benchmarking, surpassing previous efforts (e.g., BioCoder, DSBench) in scale, realism, and evaluation depth. It is publicly released with curated data, metadata, and code, encouraging community collaboration on agent development.
💻Code:
github.com/ryanwangzf/biodsa
📜Paper:
arxiv.org/abs/2505.16100
#AI4Science #BiomedicalResearch #DataScienceAgents #LLMAgents #HypothesisValidation #ComputationalBiology #ScientificReasoning #BIODSA1K #BenchmarkingAI