I've recently been struggling with the level of noise in LLM evaluations. Inspired by this, I did a deep dive into practical / applied statistics for LLM evals. Tomorrow morning, I'll publish my learnings so far in a long-form writeup on my blog.
Statistics is a huge field, but a lot of the most important concepts for approaching evals in a rigorous manner are relatively easy to learn and apply. With a grasp of applied statistics for LLMx, you can:
1. better interpret results (i.e., understand if they are meaningful or caused by noise).
2. design evals in a way that is conducive to drawing more confident conclusions.
Both of these points help us to run faster and more efficient experiments, rather than wasting time and compute chasing noise.
Some of my favorite papers so far:
- A statistical approach to LLM evaluations:
arxiv.org/abs/2411.00640
- Don't use CLT in LLM evals with fewer than 100 data points:
arxiv.org/abs/2503.01747
- Quantifying variance in evaluation benchmarks:
arxiv.org/abs/2406.10229
- A framework for reducing uncertainty in LLM evaluation:
arxiv.org/abs/2508.13144