Beyond Scores: Toward Interaction-Based Evaluation of Large Language Models (Position Paper)
Current evaluation methods for large language models (LLMs) face growing challenges, including benchmark contamination, score saturation, and models that recognize and strategically subvert the...
zenodo.org