๐Framework Aims to Enhance Evaluation of Language Models in Healthcare
Pittsburgh, October 2024 - A newly proposed framework seeks to enhance the evaluation of large language models (LLMs) in healthcare by emphasizing human evaluation processes. This comprehensive framework, known as QUEST, is designed to address current gaps in the reliability and applicability of these models, which are increasingly used in medical decision-making support and patient education.
The surge in the use of generative artificial intelligence (
#GenAI) and LLMs, such as
#gpt4, in healthcare necessitates robust evaluation methods to ensure these technologies are safe, accurate, and effective. This study identifies significant gaps in the current evaluation methodologies through a literature review of 142 studies. To bridge these gaps, the researchers propose
#QUESTโa framework subdivided into planning, implementation, and adjudication phases, focusing on five core evaluation principles, namely Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence.
The context for this study is set against the backdrop of the rapid adoption of LLMs in healthcare.
#LLMs possess the potential to transform patient care by integrating vast medical knowledge into healthcare workflows, acting as clinical decision support systems, and enhancing health literacy. However, existing evaluation practices often rely on automated metrics that lack the depth necessary to fully capture the human-like interactions these models are intended to replicate.
The methods employed in this review involved a detailed examination of various human evaluation strategies across medical specialties. The findings highlighted the prevalent reliance on automated metrics, underscoring the need for human evaluators to assess key qualities such as empathy, bias, and logical reasoning which are better captured through human judgment.
The study's authors suggest that QUEST will enable more consistent, high-quality evaluations that align closely with the safety and effectiveness benchmarks required in healthcare. "Adopting standardized human evaluation practices is critical for advancing the use of LLMs in medicine," writes
@yanshan_wang , the study's principal investigators. "QUEST provides a structured approach to systematically evaluate these models, ensuring they meet the unique challenges presented by medical applications."
This framework not only seeks to improve current evaluation practices but also to catalyze further research in this burgeoning field of
#GenAI and healthcare, aiming to enrich future developments with reliable, reproducible human assessment methodologies.
As healthcare systems increasingly turn to artificial intelligence for support, this study highlights the importance of a comprehensive and practical evaluation framework like
#QUEST, which promises to align the current LLM evaluations with high standards of patient safety and clinical effectiveness. Future directions will explore the integration of this framework across diverse medical domains, further innovating on the intersection of technology and healthcare.
Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., Osterhoudt, H., Wu, X., Visweswaran, S., Fu, S., Mathur, P., Cacciamani, G. E., Sun, C., Peng, Y., & Wang, Y. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. npj Digital Medicine, 7(258).
tnyp.me/GsldjHvW @npjDigitalMed @Pub2Post