The key discovery: small LLMs are MORE confident on wrong answers than right ones.
Calibration inversion. t=2.28, t=−3.41 across thousands of iterations.
So we built a BNN selector that exploits exactly this — ignores confidence, reads entropy.
5–7pp accuracy. ~1ms overhead.