🎙️ Invited Talk at Northeastern University (
@Northeastern) on 𝐋𝐋𝐌-𝐚𝐬-𝐚-𝐉𝐮𝐝𝐠𝐞 / 𝐀𝐮𝐭𝐨𝐫𝐚𝐭𝐞𝐫𝐬
➜ Talk:
youtube.com/watch?v=N_DwZR--…
➜ Slides:
prezi.com/view/w4tlHnXl8Byqg…
➜ Primer:
autorater.aman.ai
Thanks, Dr. Divya Chaudhry, for hosting me.
• 𝐀𝐛𝐬𝐭𝐫𝐚𝐜𝐭:
As Large Language Models (LLMs) become central to modern AI systems, evaluating their outputs has emerged as a key challenge. Traditional metrics such as BLEU and ROUGE fail to capture semantic correctness, reasoning quality, and alignment with human judgment, while human evaluation is costly and unscalable. This talk introduces LLM-as-a-Judge, a paradigm that uses LLMs as structured evaluators to produce scores, rankings, and rationales that approximate human preferences.
We show how LLM-as-a-Judge naturally connects to Learning-to-Rank frameworks through pointwise, pairwise, and listwise evaluation, enabling applications such as model benchmarking, dataset filtering, reward modeling, and reranking in production systems . We also cover practical system design, including prompt-based judges, fine-tuned ranking models, and emerging techniques such as prompt optimization and reinforcement learning.
Finally, we examine key limitations -- including bias, prompt sensitivity, and reward hacking -- and discuss mitigation strategies such as judge ensembles and calibration methods. We conclude with a practical framework for deploying reliable and scalable evaluation systems, positioning LLM-as-a-Judge as a foundational component of modern AI pipelines.
• 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐭 𝐏𝐚𝐩𝐞𝐫𝐬:
➜ Motivation and Foundations of LLM-as-a-Judge
Zheng et al., 2023, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”:
arxiv.org/abs/2306.05685
Novikova et al., 2017, “Why We Need New Evaluation Metrics for NLG”:
aclanthology.org/D17-1238/
➜ Learning-to-Rank Foundations
Burges et al., 2005, “Learning to Rank Using Gradient Descent”:
doi.org/10.1145/1102351.1102…
Burges et al., 2010, “From RankNet to LambdaRank to LambdaMART: An Overview”:
microsoft.com/en-us/research…
Cao et al., 2007, “Learning to Rank: From Pairwise Approach to Listwise Approach”:
doi.org/10.1145/1273496.1273…
➜ Neural Ranking Architectures
Nogueira and Cho, 2019, “Passage Re-Ranking with BERT”:
arxiv.org/abs/1901.04085
Nogueira et al., 2019, “Multi-Stage Document Ranking with BERT”:
arxiv.org/abs/1910.14424
Izacard and Grave, 2020, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering”:
arxiv.org/abs/2007.01282
Yoon et al., 2024, “ListT5: Listwise Re-Ranking with Fusion-in-Decoder Improves Zero-Shot Retrieval”:
arxiv.org/abs/2402.09317
➜ Specialized LLMs-as-a-Judge
Kim et al., 2023, “Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models”:
arxiv.org/abs/2310.08491
Kim et al., 2024, “Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models”:
arxiv.org/abs/2405.01535
➜ Prompt Optimization
Pryzant et al., 2023, “Automatic Prompt Optimization with ‘Gradient Descent’ and Beam Search”:
arxiv.org/abs/2305.03495
➜ Reinforcement Learning for LLM Judges
Whitehouse et al., 2025, “J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning”:
arxiv.org/abs/2505.10320
Chen et al., 2025, “JudgeLRM: Large Reasoning Models as a Judge”:
arxiv.org/abs/2504.00050
➜ Panels and Multimodal LLM Judges
Verga et al., 2024, "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models":
arxiv.org/abs/2404.18796
Li et al., 2023, “LLaVA: Large Language and Vision Assistant”:
arxiv.org/abs/2304.08485
Kim et al., 2024, “Prometheus-Vision: Multimodal Evaluation with Vision-Language Models”:
arxiv.org/abs/2401.05201
#AI #LLMs #GenAI