🤖
@CVPR 2026 Hot 🔥 Takes on Embodied AI: VLA × World Models × Agentic Loops
@CVPRConf
Embodied AI is converging toward a unified stack: VLA policies world models active perception, connected by hierarchical memory, reusable skills, and long-horizon orchestration.
🔹 Trends
• Scenario-level generalization under distribution shift (novel objects, clutter, lighting) without task finetuning.
• Sim-scale pretraining → real-world adaptation.
• Language-conditioned manipulation, hierarchical planning, reusable skills.
• Scaling axes: larger multimodal FMs, recursive refinement loops, test-time compute (reasoning/planning).
• Shift from discrete query-response systems → continuous inference, streaming state maintenance, and full-duplex perception-action loops.
🔹
@sudo_robotics
• Hierarchical VLA: language planner → skill toolbox → actions.
• Real2Sim2Real pipeline with ManiSkill3 SAPIEN.
• Foundation-model approach: scale simulation, reusable skills, language-promptable robots.
• Generalizes from fish-oil softgels to unseen plush toys across booths with zero task-specific finetuning.
• ViTaMIn-B-style visuo-tactile sensing.
• Clever hardware: multi-monocular cameras outperform stereo depth for hand-object visibility and reduced finger occlusion.
🔹
@meta_aria
Perception-first embodied engineering:
• Online calibration temperature-aware compensation.
• Detects minute calibration drift with mm-level precision.
• Pixel-level exposure adaptation for HDR environments.
• Visual-inertial SLAM optimized for localization, not photography.
• Monochrome sensors improve feature extraction and long-term tracking robustness.
🔹 ForeAct (
@MIT HAN Lab)
Visual foresight as a plug-and-play module for any VLA.
Pipeline:
Qwen3-VL → subtask decomposition → diffusion-based goal imagination → robot → VLM monitor → replanning.
Key idea:
Separate semantic reasoning, task decomposition, future prediction, and control.
ManiSkill decomposes tasks into skills; ForeAct decomposes tasks into future states.
🔹 SaPaVe (
@PKU1898 / Beihang / BAAI)
First end-to-end VLA combining semantic active perception manipulation.
Key insight:
If information is insufficient, acquire information before acting.
Architecture:
• Camera Action Decoder (2 DoF yaw/pitch semantic viewpoint control).
• Manipulation Decoder (26 DoF dual-arm control).
• Camera Adapter: LoRA on Eagle-2 VLM (<2% trainable params).
• Universal Spatial Encoder (MapAnything) injects depth, intrinsics, extrinsics, arbitrary geometry.
• ~15% performance gain from geometry-aware view-invariant reasoning.
Together:
SaPaVe = gather information
ForeAct = imagine future outcomes
Loop: reason → inspect → imagine → execute → verify → replan.
🔹 WoW (14B World Model)
• Trained on 2M robot trajectories.
• SOPHIA self-optimization: generate → VLM critique → rewrite → regenerate.
• Improves causal validity, collision reasoning, consistency.
• Learns embodied physics directly from interaction.
• Inverse Dynamics module converts imagined futures into executable actions.
🔹 Maestro
Robotics OS paradigm:
VLAs become modules inside an orchestration layer.
Responsibilities:
• Information sufficiency assessment.
• Invoke SaPaVe / ForeAct / WoW.
• Maintain long-horizon task memory.
• Policy/primitive selection.
• State tracking across time.
Emerging view:
Robotics is orchestration, not monolithic policy learning.
🔹
@NVIDIAAI Cosmos3 Discussion: Always-On World Models
@NVIDIARobotics
Hypothesis:
Future intelligence emerges from continuous prediction-reality mismatch correction.
Architecture:
• Persistent latent memory.
• Self-monologue dreaming loops.
• Continuous VLM auditing.
• Automatic memory pruning.
• Test-time learning as a first-class capability.
Inference scaling may have 3 orthogonal axes:
1️⃣ Larger multimodal models.
2️⃣ Recursive latent compression/folding.
3️⃣ Test-time rollout, search, self-consistency, continuous refinement.
Data bottleneck:
Egocentric trajectories YouTube-scale multi-view video action-conditioned interaction logs.
Potentially ~50× more high-quality action data needed for the next phase transition.
🔹 From Tokens to Robots Fireside
• VLAs and LLMs are both sequence models; robot tokens correspond to actions, states, and trajectories.
• Action spaces become robotics' version of function calling.
• World models optimize action-conditioned transition prediction rather than behavior imitation.
• RL adds critics/value functions for selecting among imagined futures.
• Failure trajectories remain valuable training data.
• Calibration may matter more than raw accuracy.
• Contact-rich interaction remains robotics' hardest challenge.
• Robotics lacks a Chinchilla-style scaling law relating data, model size, compute, and downstream performance.
• World models may become evaluation engines before policy engines.
🎯 Takeaway
Active Perception (SaPaVe) → Visual Foresight (ForeAct) → World Models (WoW) → Agentic Orchestration (Maestro)
with continuous loops of:
Perceive ↔ Imagine ↔ Predict ↔ Act ↔ Revise
The open challenge remains unifying perception, memory, planning, control, causal representation learning, diffusion MPC, and action-conditioned world modeling into a stable long-horizon embodied intelligence scaling law.