Billions of dollars are being spent to get models to beat benchmarks that are hilariously bad. A story in 7 parts.
MMLU Numerology. This benchmark is the flagship one in ML, used to grade Llama 3, GPT-4, Phi-3 (released today) and pretty much every model in between.
But try these real, quoted-in-full, questions yourself:
Q. The complexity of the theory.,?
"1,2,3,4","1,3,4","1,2,3","1,2,4",
Q. Demand reduction.,?
"1,3,4","2,3,4","1,2,3","1,2,4",
Q. Predatory pricing.,?
"1,2,4","1,2,3,4","1,2","1,4",
Q. Cultural homogenization.,?
"1,3,4","1,2,3","1,2,3,4","2,3,4",
Dozens more like this (from just my own browsing) with the numbered options containing none of the source information.
Answers C, D, D, B (lol)
1/8