Holy shit… this model might be the closest thing we’ve seen to a true visual reasoning generalist 🤯
OneThinker isn’t another “better VQA model.” It’s a full multimodal reasoning system that handles images and videos in one brain and it actually works across 10 major visual tasks. Not with separate heads stitched together, but with a single unified reasoning pipeline.
What grabbed me was how they built it.
They didn’t just scale data. They curated a massive 600k-sample multimodal dataset covering grounding, tracking, segmentation, captioning, spatial reasoning, temporal reasoning, and complex multimodal QA. Then they used a strong teacher model to rewrite the entire thing with chain-of-thought explanations, producing a 340k CoT SFT dataset for the cold start.
After that, they hit the real bottleneck: RL breaks when your tasks have wildly different reward structures. Math rewards look nothing like tracking rewards. Detection rewards look nothing like video reasoning rewards.
Their fix is clever.
EMA-GRPO keeps a moving-average reward normalization per task, which stabilizes training across heterogeneous tasks. Standard GRPO overweights low-variance tasks. Dr.GRPO removes normalization entirely and gets dominated by sparse-reward tasks. EMA-GRPO threads the needle balanced, predictable, stable.
And the results are ridiculous:
• 70.6% on MMMU
• 64.3% on MathVerse
• 84.4 R@0.5 on GOT-10k tracking
• 54.9 J&F on ReasonVOS segmentation
• A clean sweep across 31 benchmarks, covering 10 task families
• Even zero-shot generalization on tasks it wasn’t trained on
The real story is the unified training effect. Remove spatial grounding? Tracking drops. Remove image QA? Video QA weakens. Everything reinforces everything which is exactly what multimodal reasoning is supposed to look like.
My favorite part: this thing handles long-horizon video reasoning tasks better than Video-R1 and Qwen3-VL. OneThinker scores 79.2% on LongVideo-Reason, which is insane for an 8B model juggling spatial temporal cues at once.
The paper doesn’t oversell it but the implications are obvious.
We’re inching toward a single model that can look at a scene, track it over time, describe it, answer questions about it, localize objects, segment them, and reason about events unfolding in motion without swapping architectures.
A step closer to AI that actually sees and thinks, not just classifies.
ALT Research paper front page from The Chinese University of Hong Kong titled "OneThinker: All-in-one Reasoning Model for Image and Video" with authors, diagrams and abstract.