Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?
1. This study explores whether smaller open-weight large language models (LLMs) can effectively replace larger closed-source models in biomedical question answering. The authors participated in Task 13B Phase B of the BioASQ challenge and compared several open-weight models against top-performing proprietary ones like GPT-4o and Claude 3.5Sonnet.
2. The researchers used various techniques to enhance question answering capabilities, including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, ensemble approaches were utilized to leverage the diverse outputs generated by different models for exact-answer questions.
3. The results demonstrate that open-weight LLMs are comparable to proprietary ones, and in some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. This suggests that smaller open-weight models have the potential to be competitive in biomedical question answering tasks.
4. The study highlights the importance of utilizing in-context learning and selecting the best snippets for improving the performance of LLMs in biomedical question answering. The authors also experimented with different prompting strategies and found that hand-crafted prompts worked better than automated prompt generation for certain question types.
5. The authors tested multiple models, including Phi-4, Gemma-3-12B, Qwen2.5-14B, and Meditron Phi-4-14B, and found that ensembling methods, especially combining open and closed models, led to improved performance for factoid and list questions. This indicates that integrating diverse LLM families can enhance the overall performance.
6. For summary questions, the open-weight model Phi-4 exhibited promising performance in terms of ROUGE metrics. The authors used a cross-encoder reranking approach to select the best summary from candidate summaries generated by different models, showing the potential of open-weight models in generating high-quality summaries.
📜Paper:
arxiv.org/abs/2509.18843
#BiomedicalQuestionAnswering #LargeLanguageModels #OpenWeightLLMs #Ensembling #InContextLearning #BioASQChallenge