1/5 ๐ฃ Excited to share โLLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasksโ!
arxiv.org/abs/2406.18403 ๐ We introduce JUDGE-BENCH, a benchmark to investigate to what extent LLM-generated judgements align with human evaluations.
#NLProc