3. "LLM as judge" is a real tool, but it has known failure modes. Prometheus-2, JudgeLM, and Auto-J can replace expensive human evals at scale, but they inherit position bias, length bias, style bias, and self-enhancement bias. The fix is not to avoid them. It is to use multiple judges, randomize order, and ground them in rubrics.
4. The biggest pitfall is contamination. If a model was pretrained on the public web, it has likely seen the benchmark you are about to test it on. MMLU-CF, SWE-bench Verified, and dynamically generated test sets exist for a reason. Always report the training cutoff.
5. Production evaluation is a different sport. Frameworks like DeepEval, Promptfoo, LangSmith, Braintrust, Galileo, and Weights and Biases exist because lab benchmarks do not predict production behavior. CLEAR goes further and adds cost, latency, and reliability on top of accuracy.