📖Robotics World Model Reading Club #01 Summary
@BostonDynamics,
@Stanford,
@AGIBOTofficial,
@intbotai,
@BytedanceTalk,
@Google,
@moonlake,
@Rivian,
@Meta,
@Samsung,
@UCBerkeley,
@Cruise,
@encord_team,
@ManycoreTech,
@OpenGraph_Labs,
@neuralmotion,
@AMD,
@nvidia,
@oysterecosystem,
@Zoom,
@FusionFundVC,
@BoostVC,
@yzilabs...
policy learning→WM
VLA: observation→action
WAM: latent world→future trajectory→controllable action
→Shift=reactive mapping→controllable simulation
@nvidia Gr00t (7B, high mem efficiency on Thor)≈DreamDojo-style WAM. Bottleneck is NOT scale, but missing unified interface across perception–geometry–physics–action.
🧠 Representation
Pixel space is redundant & non-geometric.
Trend→Explicit 3D backbone:
point cloud/mesh
object sub-object representations
geometry-aware tracking (contact, affordance)
Point-flow pipeline:
detect→sample keypoint→track→dynamic graph
Core tradeoff=which points&density (motion saliency/affordance attn)
🌍 4D Reconstructi→Unified Latent
@GoogleDeepMind D4RT encodes video→temporally consistent latent field:
geometry motion visibility unified
Outputs: point clouds, 3D tracks, full reconstruct (300× faster)
❗Gap: no shared latent across:
vision/geometry/semantics/action/physics
⚙️ Physics Gap Sim2Real
Gap=physics, not vision:
discontinuous contact
deformable objects (∞ DoF)
non-differentiable friction
Engineering fails: brittle collision meshes, unstable contact
Solutions:
learned physics proxy
hybrid pipeline
convex decomposition (geometry → collision proxy, ~5× speedup)
🎥 Video Pretrain≠Interaction
Video=strong prior but no counterfactuals
Missing: force, depth, tactile, proprioception
→can't answer: what if act differently
⏱️ Control≠Inference
Real world=high-freq loop
action chunking
latent action
FastWAM (train with rollout, infer without)
KV-cache (AutoGaze)
👉control selects feasible trajectory, not full future modeling
Thor is good, but LLM scaling≠robotics scaling
📉 Data
No “robotics internet”:
sim/video/teleop/factory logs fragmented
no unified labeling or metrics
Reality:
factories use fixed primitives
generalization often unnecessary
Bitter lesson: data flywheel>pipelines (but robotics lacks one)
🦾 Embodiment Gap
manipulation→full-body intelligence
loco-manipulation gaze coordination
Need cross-embodiment align (space, action, kinematics)
🔁 Sim2Real Pipeline
human data→semantics→geometry→collision proxy→sim→fine-tuning
Unsolved: deformables, contact stability, long horizon
🧩 Paper
VQVAE (discrete latent)
VL-JEPA (predictive align)
token pruning (efficiency)
recursive models (depth reuse)
multi-path exploration (GRPO)
⚡ Infra→SLM
Real-time stack (LLM infra too slow)
→WM must compress into SLMs
Future=small, domain-specialized, grounded models
🧪Bottlenecks
no unified representation
no data flywheel
inference–control mismatch
physics
fragmented embodiment
Reality can't be scraped like internet.
It must be sensed, interacted, simulated.
👉 Goal: jointly optimize representation simulation action under physics constraints
💡minimal sufficient representation?
can video DiT become WAM?
vertical SLM inevitable?
robotics ImageNet moment?