The paper argues LLMs as judges are shaky, so their scores should not be trusted without strict checks.
Small wording tweaks can push safety judges to mislabel up to 100% of dangerous outputs as safe.
Large language models as judges, or LLJs, grade answers like a human across relevance, accuracy, and fluency.
Validity means the score truly reflects the target quality, and reliability means repeated runs give roughly the same score.
The paper tests 4 claims, LLJs mimic humans, evaluate well, scale easily, and cut costs, and finds gaps in each.
Human labels in popular datasets vary a lot, so correlating with them does not prove the judge understands the task.
As evaluators, LLJs often ignore instructions, conflate criteria, reward verbosity, and fall for position bias or prompt decoration.
At scale, training with and then testing under related judges leaks preferences, favors shared style, and inflates leaderboard standings.
Lower price per run hides costs like energy use and long term data quality risks when human judgment is removed.
Judge scores need proof they reflect the intended skill, otherwise they can quietly misguide progress.
----
Paper – arxiv. org/abs/2508.18076
Paper Title: "Neither Valid nor Reliable? Investigating the Use of LLMs as Judges"