Video models surprisingly can solve mazes, but inconsistently. We understand little about how they reason, making it hard to use such abilities.
We investigate the denoising process and find models commit to a plan early, letting us screen far more candidates for better perf.
🧵