Associate Professor @UTCompSci | Director @NVIDIAAI Co-Leading GEAR | CS PhD @Stanford | Building generalist robot autonomy in the wild | Opinions are my own

Joined August 2008
68 Photos and videos
Exciting news on GR00T: NVIDIA announces our first open humanoid robot platform, featuring Unitree H2 Plus and Sharpa hands, to accelerate academic research and facilitate cross-institutional collaboration. R&D in humanoid robotics needs broader participation. Open science is how we build the future faster, together.
NVIDIA announces the first open humanoid robot reference design built for robotics research. The NVIDIA Isaac GR00T Reference Humanoid Robot combines the @UnitreeRobotics H2 humanoid robot, @SharpaRobotics Wave five-fingered hands for dexterous manipulation, Jetson Thor onboard compute, and Isaac GR00T open software and models, giving researchers a full-stack platform from data capture to model deployment. Read the #NVIDIAGTC Taipei announcement: nvda.ws/4ef9VOr
3
14
109
14,606
I will be in Vienna in two weeks to give a keynote at #ICRA2026. I'll share our recent progress on building generalist humanoid robots and show some of the latest results. Check out my talk on June 3: 2026.ieee-icra.org/program/k…
3
6
101
5,506
Yuke Zhu retweeted
Now you can use GR00T N1.7 and SONIC together to enable tasks that require TRUE whole-body coordination!! Including simultaneous precise hand and foot placement, like opening a trash can with the foot pedal and throwing an object inside! Try it yourself, it is so fun!
Open-sourcing the whole package here! The last piece of our SONIC open-source, data collection, gr00t VLA post-training, inference just hit the repo! Train your Autonomous policies on G1 Whole-body with SONIC and gr00t N1.7! 🧑‍💻Code: github.com/NVlabs/GR00T-Whol… 📑Docs: nvlabs.github.io/GR00T-Whole…
10
35
9,049
Yuke Zhu retweeted
GR00T-VisualSim2Real is now open source! VIRAL and DoorMan are now available with training code, simulation assets, and the full recipe for bringing visual sim-to-real loco-manipulation skills to your own humanoids. Repo: github.com/NVlabs/GR00T-Visu…
20 Nov 2025
Zero teleoperation. Zero real-world data. ➔ Autonomous humanoid loco-manipulation in reality. Introducing VIRAL: Visual Sim-to-Real at Scale. We achieved 54 autonomous cycles (walk, stand, place, pick, turn) using a simple recipe: 1. RL 2. Simulation 3. GPUs Website: viral-humanoid.github.io/ Arxiv: arxiv.org/abs/2511.15200 Deep dive with me: 🧵
6
97
618
116,823
Yuke Zhu retweeted
What is missing to bring real-time motion research into AAA games and real-world robotics? We present MotionBricks, a step toward bridging this gap with two key components: - a single generative latent motion backbone covering 350,000 motion skills, running at 15,000 FPS with 2 ms latency and substantially improved quality and reliability. - a unified smart primitive interface for locomotion, object / scene interaction, with fine-grained control over generated behaviors. Webpage: nvlabs.github.io/motionbrick… Code: github.com/NVlabs/GR00T-Whol… Paper: arxiv.org/abs/2604.24833 (ACM TOG / SIGGRAPH 2026)
27
150
1,197
151,738
Yuke Zhu retweeted
🤖Co-training is everywhere (sim↔real[e.g. GR00T, LBM], human↔robot[e.g. PI, EgoScale], even non-robot data[e.g. PI, LBM). But why does it work? How can we improve it further? Taking sim-and-real imitation learning in diffusion/ flow-based models as the test bed, we performed a rigorous mechanistic analysis, drawing on theoretical insights and multi-layered experiments. 😮Key insight: it’s all about representations. - Alignment → enables transfer - Discernibility → enables adaptation ⚖️Both are necessary — it's better to have more aligned representations, but the model must be able to discern the domains. We term this as structured representation alignment. ⬇️Let’s take a deep dive into that: Paper: arxiv.org/pdf/2604.13645 Website: science-of-co-training.githu…
5
66
385
62,625
We've just open-sourced the SONIC training code, the training data, the algorithms used to generate the data, and more to come. You now have the full recipe for building SONIC whole-body controllers for your own humanoids. Enjoy! Code: github.com/NVlabs/GR00T-Whol… Web Demo: nvlabs.github.io/GEAR-SONIC/…
SONIC is now open-source! Generalist whole-body teleoperation for EVERYONE! Our team has long been building comprehensive pipelines for whole-body control, kinematic planner, and teleoperation, and they will all be shared. This will be a continuous update; inference code model already there, training code and gr00t integration coming soon! Code: github.com/NVlabs/GR00T-Whol… Docs: nvlabs.github.io/GR00T-Whole… Site: nvlabs.github.io/GEAR-SONIC/
5
27
174
16,822
Yuke Zhu retweeted
Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵
20
126
632
168,796
Yuke Zhu retweeted
Catastrophic forgetting has long been a challenge in continual learning. However, our new study found that pretrained Vision-Language-Action (VLA) models are surprisingly resistant to forgetting! Zero forgetting, or even positive backward transfer, is possible with simple experience replay. arxiv.org/abs/2603.03818
18
108
667
53,671
Today, we publicly released RoboCasa365, a large-scale simulation benchmark for training and systematically evaluating generalist robot models. Built upon our original RoboCasa framework, it offers: • 2,500 realistic kitchen environments; • 365 everyday tasks (basic skills long-horizon mobile manipulation); • Over 3,200 objects with many articulated fixtures/appliances. All are designed for fully controlled, reproducible benchmarking of robotic policies. Progress in robotic foundation models is real. But it’s still hard to answer basic questions like: How close are we to general-purpose autonomy? What factors drive generalization? What are the model/data scaling curves like? Real-world eval is slow and noisy, and existing sims (like LIBERO, which we built 3 years ago) often lack sufficient task and scene diversity. This benchmark comes with 2,200 hours of demonstrations and 500K trajectories to support studies of multi-task training, pretraining, and continual learning at scale. Check it out at robocasa.ai
13
63
340
23,644
CoRL is coming to Austin, TX this November! As General Chair, I'm thrilled to welcome the robot learning community. 2026 feels like a pivotal year as AI-powered robotic systems begin deploying at scale for real-world tasks. This year, I hope CoRL will be the forum that connects cutting-edge research with industrial practice. Please submit your best work and join us in Austin. DM me what you'd love to see CoRL do better! corl.org/
Calling all researchers! 🤖The CoRL 2026 website is officially live at corl.org with key dates for your submissions: 🗓 May 25: Abstract Submission 🗓 May 28: Full Paper Submission 🗓 Nov 9-12: Conference in Austin, TX Send us your coolest work! #RobotLearning
5
7
165
15,837
Yuke Zhu retweeted
We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000 hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30% gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:
150
283
1,772
291,298
Yuke Zhu retweeted
Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning *inside* a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation *inside* a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains 17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:
81
176
1,234
208,690
We have seen rapid progress in humanoid control — specialist robots can reliably generate agile, acrobatic, but preset motions. Our singular focus this year: putting generalist humanoids to do real work. To progress toward this goal, we developed SONIC (nvlabs.github.io/GEAR-SONIC/), a Behavior Foundation Model for real-time, whole-body motion generation that supports teleoperation and VLA inference for loco-manipulation. Today, we’re open-sourcing SONIC on GitHub. We are excited to see what the community builds upon SONIC and to collectively push humanoid intelligence toward real-world deployment at scale. 🌐 Paper: arxiv.org/abs/2511.07820 📃 Code: github.com/NVlabs/GR00T-Whol…
11
67
353
67,356
Yuke Zhu retweeted
🚀 Introducing CHIP: Adaptive Compliance for Humanoid Control through Hindsight Perturbation! Current humanoids face a trade-off: they are either Agile & Stiff OR Slow & Soft. CHIP breaks this barrier. We enable on-the-fly switching between Compliant (wiping 🧼, collaborative holding 📦) and Stiff (lifting dumbbells 🏋️, opening doors 🚪💪) behaviors—all while maintaining agile skills like running! 🏃💨 Website: nvlabs.github.io/CHIP/ Join me for a deep dive on how CHIP enables adaptive control for complex tasks. 🧵↓
10
51
216
25,453
📢 New paper from GEAR team @NVIDIARobotics We released DreamZero, a World Action Model that turns video world models into zero-shot robot policies. Built on a pretrained video diffusion backbone, it jointly predicts future video frames and actions. 🌐 dreamzero0.github.io/

Introducing DreamZero 🤖🌎 from @nvidia > A 14B “World Action Model” that achieves zero-shot generalization to unseen tasks & few-shot adaptation to new robots > The key? Jointly predicting video & actions in the same diffusion forward pass Project Page: dreamzero0.github.io/ 🧵 (1/10)
8
102
8,402
Yuke Zhu retweeted
Reality of robotics: humanoid kung fu is solved before they can open doors with RGB. Here we are. Introducing the frontier of sim2real at NVIDIA GEAR. 100% sim data. RGB input only. Code name: 𝗗𝗼𝗼𝗿𝗠𝗮𝗻. We are opening the sim-to-real door. doorman-humanoid.github.io 🧵
14
80
511
370,478
20 Nov 2025
Robustness is key for real-world robot deployment — and RL is key to robustness. Proud of our work scaling GPU-based simulations and vision-based sim2real to build robust policies for humanoid loco-manipulation tasks.
20 Nov 2025
Zero teleoperation. Zero real-world data. ➔ Autonomous humanoid loco-manipulation in reality. Introducing VIRAL: Visual Sim-to-Real at Scale. We achieved 54 autonomous cycles (walk, stand, place, pick, turn) using a simple recipe: 1. RL 2. Simulation 3. GPUs Website: viral-humanoid.github.io/ Arxiv: arxiv.org/abs/2511.15200 Deep dive with me: 🧵
3
10
74
12,100
1 Nov 2025
Had a blast visiting @CMU_Robotics and gave a talk at the RI Seminar today, where I briefly mentioned our new work, PLD, on self-improving VLAs. It achieved 99.2% on LIBERO and a one-hour continuous execution of GPU assembly with a 100% success rate. Check this out!
31 Oct 2025
What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵
5
27
224
25,442