Here is the short version of what I learned.
1. Capability benchmarks are not the same as capability. MMLU tests broad knowledge. HumanEval tests isolated Python functions. SWE-bench tests whether an agent can fix a real GitHub issue end to end. A 90% on MMLU tells you almost nothing about whether the model will actually be useful in your stack.
2. The real categories are wider than most people think. Beyond capability, you need to measure safety (DecodingTrust, HarmBench, WMDP, MLCommons AILuminate), agent behavior in real environments (AgentBench, WebArena, OSWorld, GAIA), retrieval quality for RAG systems (RAGAS, ARES, TruLens), and embedding quality (MTEB, BEIR). HELM is the only mainstream framework that tries to cover all of this in one place. It tracks accuracy, calibration, robustness, fairness, efficiency, bias, and toxicity together.