BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedicine
1.BioHopR introduces a new benchmark specifically designed to evaluate multi-hop, multi-answer reasoning capabilities of LLMs in the biomedical domain. It captures real-world complexity by focusing on one-to-many and many-to-many relationships, such as drug–disease–protein connections.
2.Unlike previous biomedical QA datasets that mainly target single-answer or template-based reasoning, BioHopR supports both 1-hop and 2-hop reasoning tasks using structured knowledge from the PrimeKG graph. This allows assessment of stepwise reasoning with multiple valid answers.
3.The benchmark contains 2,494 1-hop and 7,633 2-hop questions, totaling over 279,000 answers. Each query averages over 36 answers, emphasizing the challenge of handling exhaustive, many-to-many mappings in biomedical data.
4.A core innovation is the one-to-many-to-many design: for example, a single drug (query) can be linked to multiple proteins (bridge) and further to multiple diseases (target), reflecting real clinical reasoning scenarios.
5.BioHopR evaluates models’ precision based on cosine similarity using BioLORD-2023-C embeddings with a strict threshold (0.9), ensuring high-confidence matches in prediction evaluation.
6.State-of-the-art proprietary models, especially O3-mini and GPT4O, outperform open-source biomedical models in both 1-hop and 2-hop tasks. O3-mini achieves the highest 1-hop precision (37.93%) and ties with GPT4O in 2-hop precision (14.57%).
7.Open-source biomedical models, such as HuatuoGPT-o1 and UltraMedical-8B, exhibit severe difficulties in multi-hop tasks. HuatuoGPT-o1-70B fails almost entirely (Prec_HOP2: 0.00%), revealing a gap between intended capability and actual performance.
8.All models suffer a sharp performance drop when moving from 1-hop to 2-hop tasks, demonstrating that inferring intermediate bridge entities remains a major bottleneck in biomedical reasoning.
9.Qualitative case studies (e.g., on diabetes and schizophrenia) show proprietary models like GPT4O can produce medically justifiable answers even when they don't align perfectly with ground truth, while open-source models often fail to follow task constraints.
10.BioHopR also reveals prompt sensitivity: single-answer prompting outperforms multi-answer prompting significantly (e.g., GPT4O's 1-hop precision drops from 32.88% to 8.09% when prompted for multiple answers).
11.Some relation types (e.g., Phenotype:Disease:Drug) show slightly better performance in multi-answer prompting, suggesting structured queries with clear entity roles might be more model-friendly.
12.By benchmarking models on complex, real-world multi-step biomedical questions, BioHopR establishes a much-needed evaluation standard and exposes critical gaps in both reasoning depth and answer completeness.
13.Despite limitations like reliance on PrimeKG and coverage of only four entity types (Drug, Disease, Protein, Phenotype), BioHopR sets the groundwork for building more robust, interpretable LLMs for biomedical applications.
14.The authors emphasize that BioHopR is for research use only and not suited for clinical deployment without rigorous validation, mitigating risks of misuse in high-stakes settings.
💻Code:
huggingface.co/datasets/know…
📜Paper:
arxiv.org/abs/2505.22240v1
#BioHopR #LLMs #BiomedicalAI #MultiHopReasoning #BioNLP #KnowledgeGraphs #MedQA