PeerRank: Autonomous LLM Evaluation Through Web-Grounded,...
Evaluating large language models typically relies on human-authored benchmarks, reference answers, and human or single-model judgments, approaches that scale poorly, become quickly outdated, and...
arxiv.org