The second hypothesis: maybe the bottleneck is visual perception, not reasoning.
We ran a small diagnostic. For a few simple Blender tasks where the visible state can be manually transcribed, we fed Gemini a per-frame text transcript instead of the video. Gemini, which struggles on the video version, now solves the same task near perfectly.
Note: This isn't a fix. Most VSTAT tasks, especially real-world ones, can't be hand-transcribed at all. But as a probe, it isolates the bottleneck: perception, not reasoning.
[5/11]