How do we improve VLM post-training?
Don't just force longer-horizon multimodal reasoning.
Reshape the curriculum to match how VLMs learn. 👀➡️🧠
Excited to share our new
#ICML2026 paper in collaboration with
@ucsc @amazon @UWaterloo @VectorInst:
"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models"
🌐 Project:
ucsc-vlaa.github.io/VLM-CapC…
📄 Paper:
arxiv.org/abs/2605.20177
💻 Code:
github.com/UCSC-VLAA/VLM-Cap…
🤗 HF Collections: UCSC-VLAA/VLM-CapCurriculum
Recent VLM post-training has largely focused on scaling up reasoning through RL and long chain-of-thought traces. But after auditing model failures across visual math, geometry, and diagram reasoning tasks, we found a different bottleneck:
👉 86.9% of Qwen3-VL-8B's errors originate from perception, not reasoning.
Once a model misreads the image, additional reasoning often just reinforces the wrong interpretation.
Our key insight is simple: before a model can reason better, it must first see better.
To achieve this, we introduce a new simple capability curriculum for VLM post-training:
🟦 Visual Perception → 🟩 Textual Reasoning → 🟨 Visual Reasoning
Instead of mixing all capabilities into a single RLVR stage, we train them sequentially, allowing perception to develop as a dedicated foundational capability.
The results are remarkably consistent:
✅ Staged post-training outperforms standard merged training across 4 VLM backbones (
#Qwen2.5-VL,
#Qwen3-VL,
#InternVL3,
#InternVL3.5)
✅ Better perception literally lets the model think less!The staged post-training on
#Qwen3-VL-8B uses only 79% of the reasoning tokens of the merged model (20.8% shorter traces) achieves 1.46% average accuracy.
✅We also show that capability curriculum and traditional difficulty curriculum are complementary. On
#Qwen3-VL-8B, combining both boosts average performance from 58.6 → 63.0, outperforming either curriculum alone.
The broader message is that curriculum learning for VLMs should not only consider how difficult an example is, but also which capability it develops.
📦 We are releasing everything:
1️⃣ A new capability curriculum paradigm, orthogonal to difficulty-based curricula
2️⃣ High-quality perception training data built with a scalable DOCCI-based synthesis and filtering pipeline
3️⃣ 4 staged-trained VLMs:
#Qwen2.5-VL-7B,
#Qwen3-VL-8B,
#InternVL3-8B, and
#InternVL3.5-8B
4️⃣ Full training, evaluation, and perception-error analysis code
Led by my PhD student
@JJwu41867797, as well as an outstanding team of collaborators
@HardyChen266091,
@HaoqinT, Xianfeng Tang,
@fredahshi, Hui Liu, Hanqing Lu,
@cihangxie