🌌 At
@saturdayrobotic Saturday Robotics Research Night, we hosted
@mli0603 Zhaoshuo Li (Robotics & World Model Tech Lead
@NVIDIAAI Cosmos) for a lightning talk on Cosmos 3.
Cosmos 3 is a unified omnimodal world model built on a Mixture-of-Transformers (MoT) backbone with parallel Autoregressive Diffusion pathways connected via cross-attention. One model jointly understands & generates Language, Image, Video, Audio, and Action with flexible I/O.
It effectively subsumes:
👁️ VLMs
🎥 Video Generators
🔊 Audio Generators
🌍 World Simulators
🤖 World-Action Models
🎮 Robot Policy Models
Single backbone supports:
• Vision Reasoning
• Image Generation
• Audio-Visual Generation
• Robot Policy Control
• Forward Dynamics
• Inverse Dynamics
Vision reasoning grounds language in spatial relations, temporal evolution, object states, and actions.
Forward Dynamics:
(obs controls) → future video rollouts for planning, evaluation, and synthetic data generation.
Inverse Dynamics:
(video) → trajectories/actions explaining observed state transitions.
🍿 Popcorn demo:
0.3–3.4s pick cup
3.4–14.8s stabilize cup → insert scoop → scoop twice → transfer popcorn while maintaining alignment
14.8–18.7s place cup → return scoop → retract arms
Not frame captioning—the model temporally segments manipulation into physically meaningful subgoals.
Forward Dynamics demo:
camera observation (blue point-cloud-like representation) hand pose (green skeletal hands) → physically plausible future interaction rollouts respecting object dynamics.
Inverse Dynamics demo:
robot manipulation video → articulated 3D trajectories recovered from observed pixel changes.
🔥 Most impressive: Cosmos 3 Omni Block.
Prompt:
“pick the Cosmos 3 Omni block from bottom drawer and place it on counter”
The model first performs explicit spatial grounding:
gripper(514,769)
block(471,780)
drawer(400,760)
counter(460,310)
while identifying distractors:
forklift, white truck, white SUV, quadruped robot, Physical AI Builder figure.
It then generates structured reasoning pixel-space action outputs:
[514,769] approach block
[507,783] grasp block
[500,471] lift from drawer
[464,278] move to counter
[460,275] place on counter
A second, far more cluttered scene containing multiple robot arms, excavators, vehicles, and the same drawer receives the identical prompt and produces analogous trajectories after grounding relevant objects and free-space regions.
Cosmos 3 positions omnimodal world models as a scalable foundation for embodied agents, jointly performing understanding, generation, simulation, reasoning, and control inside a single architecture.
It achieves SoTA across diverse understanding & generation benchmarks, and NVIDIA is releasing the full stack: code, checkpoints, curated synthetic datasets, and evaluation benchmarks.
Cosmos 3 = a unified world-action engine.