CVPR 2026 Embodied AI Highlight Papers
Active Perception · Visual Foresight · Embodied Cognitive Loops
1. ForeAct (MIT HAN Lab, Zhuoyang Zhang, Shang Yang et al., arXiv:2602.12322,
github.com/mit-han-lab/forea…)
ForeAct delivers efficient visual foresight that steers any VLA via atomic visual goal imagination.
It addresses the failure mode where sufficient information already exists, but explicit future grounding is missing.
If SaPaVe answers: Do I know enough to act?
ForeAct answers: Now that I know enough, what exactly should success look like?
The core argument: existing VLAs are overloaded. They simultaneously perform: semantic reasoning, task decomposition, future prediction, visuo-motor control.
ForeAct explicitly separates these responsibilities.
This resembles skill-library systems such as ManiSkill in spirit, but with a different abstraction:
ManiSkill decomposes tasks into reusable skills;
ForeAct decomposes tasks into reusable future states.
Unlike Sudo-style systems that reduce VLAs into lightweight coordinators over primitives, ForeAct keeps the VLA intact and steers it via visual foresight.
Closed loop pipeline:
Qwen3-VL → subtask → ImGen → robot (multi-cam) → VLM monitor / re-plan
(finer granularity than ManiSkill skills; no VLA replacement, unlike Sudo-style coordination layers)
2. SaPaVe (Mengzhen Liu, Enshen Zhou et al., PKU / Beihang / BAAI, arXiv:2603.12193)
SaPaVe delivers the first end-to-end VLA unifying semantic active perception and manipulation via explicit decoupling.
It addresses insufficient information before action.
I was surprised that the human-like paradigm:
“Look again, look closer, look left and right”
(combining perception action)
was not already well-established in VLAs—it is extremely natural for embodied intelligence.
Core insight
SaPaVe solves the regime where robots lack: occlusion understanding, grasp affordances, articulation state,
action success certainty.
Existing VLAs operate under passive perception: fixed camera viewpoints, direct manipulation prediction from static observations.
However, active perception introduces a key coupling problem: moving the camera changes observations, manipulating objects changes observations, reorienting objects changes observations.
Traditional unified action spaces entangle: camera motion objectives, manipulation objectives.
SaPaVe resolves this via explicit decoupling.
Decoupled design
Embodied intelligence becomes a two-branch decision process:
- test information sufficiency
- if sufficient → act; if insufficient → active information acquisition.
SaPaVe ForeAct together instantiate this loop:
reason → gather info → imagine futures → execute → verify → re-plan
(vs traditional perceive → act)
SaPaVe architecture
Camera Action Decoder: 2 DoF (pitch yaw), embodiment-agnostic semantic viewpoint control, supports: “look left / zoom / inspect behind”
Manipulation Action Decoder: 26 DoF joint positions, dual-arm dexterity
Decoupled heads outperform unified decoder (71.25% vs lower baseline)
Camera / perception modules
Camera Adapter: LoRA on Eagle-2 VLM, <2% trainable parameters, learns semantic active perception priors, preserves base manipulation knowledge
Universal Spatial Encoder (MapAnything): injects depth intrinsics extrinsics arbitrary geometry, element-wise fused into VLM tokens & action head during denoising, enforces view-invariant 3D consistency, improves performance by ~15% even on simple tasks.
3. Long-horizon cognition: WoW (arXiv:2509.22642)
WoW is a 14B embodied world model trained on 2M robot trajectories (not passive video).
Key mechanism: SOPHIA self-optimizing loop: generate,
VLM critique (physical causal validity), rewrite, regenerate.
This improves: consistency, collision reasoning, causal validity.
Unlike video-only world models, WoW learns physical dynamics directly from embodied interaction.
It also introduces Inverse Dynamics → executable actions, achieving SOTA on manipulation simulation and real Franka setups.
Overall implication: embodied pretraining may function as meta-learning for intuitive physics.
4. Agent OS / Robotics orchestration: Maestro (
maestro-robot.github.io)
Maestro reframes VLAs as modules inside a robot operating system layer.
This OS layer is responsible for: deciding information sufficiency, invoking SaPaVe / ForeAct / WoW, tracking long-horizon state, selecting primitives / policies, maintaining task memory across time
Pure VLAs remain weak at long-horizon reasoning.
Missing system components (explicit gaps): causal latent learning (MPI-style), Diffusion MPC, tighter integration between generative world models and real-time control.
Related systems (e.g., Dexmate) similarly argue for: representation layers, world models, agentic harnesses, modular execution systems.
The emerging paradigm: robotics as orchestration, not monolithic policy learning
Conclusion
SaPaVe (information acquisition layer): semantic active perception, embodiment-agnostic camera control, decoupled action modeling, geometry-aware viewpoint reasoning.
ForeAct (future grounding layer): atomic subtask decomposition, visual goal imagination, efficient diffusion-based foresight, plug-and-play steering of existing VLAs.
System stack: Above both layers sit: embodied world models (WoW), agentic orchestration frameworks (Maestro), representation-centric architectures (Dexmate)
Likely missing ingredients to close the loop: causal latent representation learning, diffusion-based model predictive control, MPI-style causal world modeling frameworks.
@CVPR @CVPRConf @saturdayrobotic #CVPR2026