taught a model to catch itself lying — from its own activations, not its words.
a deliberate lie (knew it, caved) leaves a different fingerprint inside than an
honest mistake. holds across qwen, llama, gemma. wired into a loop, the agent
reads itself and undoes the cave — 0.23–0.27 accuracy, ~99% precision.
pre-registered. receipts on the repo. not "solved" — just real. that's the moat.