Jim Fan

Jim Fan

851 Photos and videos

Tweets

Pinned Tweet

Jim Fan

@DrJimFan

May 8

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;) And stay till the end, more easter eggs and predictions for your polymarket! 00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum! 01:42 The Great Parallel 03:31 Robotics, the Endgame 03:39 Why VLAs fall short 04:32 Video world models as the 2nd pretraining paradigm 06:09 World Action Models (WAM) 07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation 11:06 EgoScale and the Dexterity Scaling Law we discovered recently 14:00 Physical RL: bridging the last mile 15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico 17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think. Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.

20:02

167

548

3,436

571,640

Jim Fan

Jim Fan

@DrJimFan

Jun 5

NitroGen just won CVPR Best Paper Honorable Mention!! We are making strides towards general-purpose embodied agents that master not only the real world physics, but also all possible physics across a multiverse of simulations. It’s been 4 years since MineDojo, our first embodied agent in Minecraft, won NeurIPS Best Paper. Congrats to everyone on the team!!

381

36,824

Jim Fan

Jim Fan

@DrJimFan

Jun 5

Check out the NVIDIA blog!

NVIDIA

@nvidia

Jun 3

This week at #CVPR2026, NVIDIA Research is presenting three papers across physical ai that offer groundbreaking solutions for training at scale across diverse applications: → GraspGen-X: the first foundation model for zero-shot grasping, trained on billions of simulated grasps → LCDrive: a model that replaces expensive text-based reasoning with compact latent representations → NitroGen: a generalized gameplay AI foundation model that harnesses NVIDIA Isaac GR00T to help train embodied agents Learn more: nvda.ws/4ubwjgk

0:12

13,767

CyberRobo

Jim Fan retweeted

CyberRobo

@CyberRobooo

May 9

Mark: 1/ First milestone: the Physical Turing Test. You literally can’t tell if a human or robot is doing the task. 2/ Next: Physical API. A fleet of robots, configured like software via APIs & CLI. 3/ Final stop: Physical Auto Research. Robots design, improve, and build the next generation of themselves--far beyond human capability. -- If you believe in robotics, robotics will believe in you.

Jim Fan

@DrJimFan

May 8

20:02

128

40,680

Alfred Lin

Jim Fan retweeted

Alfred Lin

@Alfred_Lin

May 8

Jim is always a crowd favorite at AI Ascent. His ability to simplify the latest research into a clear "what and why it matters" while adding humor along the way is unmatched. If you're interested in physical AI, this 20 minutes is a must watch.

Jim Fan

@DrJimFan

May 8

20:02

181

60,965

Sonya Huang 🐥

Jim Fan retweeted

Sonya Huang 🐥

@sonyatweetybird

May 8

Our crowd favorite from last year’s AI Ascent is back for round 2… this time: Robotics The Endgame ♟️ thank you for dazzling us @DrJimFan ! You can see the forest from the trees and are quite the entertaining speaker — a mini Jensen in the making :)

Jim Fan

@DrJimFan

May 8

20:02

37,175

Jim Fan

Jim Fan

@DrJimFan

May 8

20:02

167

548

3,436

571,640

Jim Fan

Jim Fan

@DrJimFan

May 8

The Physical Turing Test, May 2025 at Sequoia AI Ascent youtube.com/watch?v=_2NijXqB…

The Physical Turing Test: Jim Fan on Nvidia's Roadmap for Embodied AI

Nvidia's Director of AI Jim Fan introduces the concept of the Physi...

youtube.com

34,062

Jim Fan

Jim Fan

@DrJimFan

May 8

Robotics: Endgame on YouTube youtube.com/watch?v=3Y8aq_of…

221

190,499

Jim Fan

Jim Fan

@DrJimFan

Apr 1

The power of the Claw, in the palm of a robot hand. Agentic robotics is here! Today, we open-source CaP-X: vibe agents, alive in the physical world. They incarnate as robot arms and humanoids with a rich set of perception APIs, actuation APIs, and auto synthesize skill libraries as they go. CaP-X is a strict superset of our old stack, because policies like VLAs are “just” API calls as well. It solves many tasks zero-shot that a learned policy would struggle with. And we are doing much more than vibing. CaP-X is our most systematic, scientific study on agentic robotics so far: - We build a comprehensive agentic toolkit: perception (SAM3 segmentation, Molmo pointing, depth, point cloud), control (IK solvers, grasp planner, navigation), and visualization (EEF, mask overlays) that work across different robots. - CaP-Gym: LLM’s first Physical Exam! 187 manipulation tasks across RoboSuite, LIBERO-PRO, and BEHAVIOR. Tabletop, bimanual, mobile manipulation. Sim and real. Can’t wait to see the gradients flow from CaP-Gym to the next wave of frontier LLM releases. - CaP-Bench: we benchmark 12 frontier LLMs/VLMs (Gemini, GPT, Opus, Qwen, DeepSeek, Kimi, and more) across 8 evaluation tiers. We systematically vary API abstraction level, agentic harness, and visual grounding methods. Lots of insights in our paper. - CaP-Agent0: a training-free agentic harness that matches or exceeds human expert code on 4 out of 7 tasks without task-specific tuning. - CaP-RL: if you get a gym, you get RL ;). A 7B OSS model jumps from 20% to 72% success after only 50 training iterations. The synthesized programs transfer to real robots with minimal sim-to-real gap. 3 years ago, our team created Voyager, one of the earliest agentic AI that plays and learns in Minecraft continuously. Its key ideas — skill libraries, self-reflection loops, and in-context planning — have since influenced many modern agentic designs. Today, the agent graduates from Minecraft and gets a real job. It’s April Fool’s, but this Claw is getting its hands dirty for real! Link in thread:

1:31

102

114

736

78,176

Jim Fan

Jim Fan

@DrJimFan

Apr 1

As usual, we open-source everything, MIT license: capgym.github.io Code: github.com/capgym/cap-x Paper: arxiv.org/abs/2603.22435 CaP-X is brought to you by NVIDIA, Berkeley, Stanford, and CMU. I'd like to thank the legend @Ken_Goldberg who co-advised the work, and the team who poured their hearts into it!

20,797

Jim Fan

Jim Fan

@DrJimFan

Apr 1

Please check out lead author @letian_fu's deep dive thread! x.com/letian_fu/status/20393…

Max Fu

@letian_fu

Apr 1

Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵

1:31

24,446

Jim Fan

Jim Fan

@DrJimFan

Mar 24

This is pure nightmare fuel. Identity theft of the past would be nothing compared to what vibe agents can do. Sending credentials is too obvious and for rookies. They could easily spread contaminations across ~/.claude, **/skills/*, or even just a PDF your agent visits periodically in /morning-brief. Your entire filesystem is the new distributed codebase. Every file that could go into context would add to the attack vector. Every text can be a base64 virus. In the new world of on-demand software, I try to minimize dependencies - people rarely need all the APIs supported in LiteLLM, might as well build a custom router with only what you need on the fly (which I did in one of my late-night claude sessions). Unfortunately, there is very little middleground between "pressing yes mindlessly for every edit" and "--dangerously-skip-permissions". There will be a full blooming industry for "de-vibing": dampening the slop and putting guardrails/accountability around agentic frameworks. They are the boring old, audited Software 1.0 that watches over the rebellious adolescents of Software 3.0. Claws need shells. Probably many layers of nested shells.

Daniel Hnyk @hnykda

Mar 24

LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server self-replicate. link below

563

108,122

Jim Fan

Jim Fan

@DrJimFan

Mar 23

Teleop is so 2025. Ever since we unveiled EgoScale and the dexterity scaling law, it's been clear to us and the ecosystem that behavior cloning directly from humans is the way to break the curse of teleop. 2026 is all about scaling robot learning without robots.

Danfei Xu

@danfei_xu

Mar 23

Introducing EgoVerse: an ecosystem for robot learning from egocentric human data. Built and tested by 4 research labs 3 industry partners, EgoVerse enables both science and scaling 1300 hrs, 240 scenes, 2000 tasks, and growing Dataset design, findings, and ecosystem 🧵

1:00

599

107,158

Jim Fan

Jim Fan

@DrJimFan

Feb 25

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000 hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30% gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

1:18

150

283

1,772

291,295

Jim Fan

Jim Fan

@DrJimFan

Feb 25

This is a huge team work at NVIDIA Robotics. Check out @ruijie_zheng12's deep dive: - Website: research.nvidia.com/labs/gea… - Paper: arxiv.org/abs/2602.16710 x.com/ruijie_zheng12/status/…

Ruijie Zheng @ruijie_zheng12

Feb 25

Proud to introduce EgoScale: We pretrained a GR00T VLA model on 20K hours of egocentric human video and discovered that robot dexterity can be scaled, not with more robots, but with more human data. A thread on 🧵what we learned. 👇

1:18

146

38,106

Jim Fan

Jim Fan

@DrJimFan

Feb 25

We would also like to thank our dexterous hand hardware provider, Sharpa, for their great support!

147

18,905

Jim Fan

Jim Fan

@DrJimFan

Feb 24

What can half of GPT-1 do? We trained a 42M transformer called SONIC to control the body of a humanoid robot. It takes a remarkable amount of subconscious processing for us humans to squat, turn, crawl, sprint. SONIC captures this "System 1" - the fast, reactive whole-body intelligence - in a single model that translates any motion command into stable, natural motor signals. And it's all open-source!! The key insight: motion tracking is the one, true scalable task for whole body control. Instead of hand-engineering rewards for every new skill, we use dense, frame-by-frame supervision from human mocap data. The data itself encodes the reward function: "configure your limbs in any human-like position while maintaining balance". We scaled humanoid motion RL to an unprecedented scale: 100M mocap frames and 500,000 parallel robots across 128 GPUs. NVIDIA Isaac Lab allows us to accelerate physics at 10,000x faster tick, giving robots many years of virtual experience in only hours of wall clock time. After 3 days of training, the neural net transfers zero-shot to the real G1 robot with no finetuning. 100% success rate across 50 diverse real-world motion sequences. One SONIC policy supports all of the following: - VR whole-body teleoperation - Human video. Just point a webcam to live stream motions. - Text prompts. "Walk sideways", "dance like a monkey", "kick your left foot", etc. - Music audio. The robot dances to the beat, adapting to tempo and rhythm. - VLA foundation models. We plugged in GR00T N1.5 and achieved 95% success on mobile tasks. We open-source the code and model checkpoints!! Deep dive in thread:

3:07

217

1,520

223,835

Jim Fan

Jim Fan

@DrJimFan

Feb 24

Website: nvlabs.github.io/GEAR-SONIC/ Codebase and weights: github.com/NVlabs/GR00T-Whol… Whitepaper: arxiv.org/abs/2511.07820 Check out @zhengyiluo's post: x.com/zhengyiluo/status/2024…

Zhengyi “Zen” Luo

@zhengyiluo

Feb 20

SONIC is now open-source! Generalist whole-body teleoperation for EVERYONE! Our team has long been building comprehensive pipelines for whole-body control, kinematic planner, and teleoperation, and they will all be shared. This will be a continuous update; inference code model already there, training code and gr00t integration coming soon! Code: github.com/NVlabs/GR00T-Whol… Docs: nvlabs.github.io/GR00T-Whole… Site: nvlabs.github.io/GEAR-SONIC/

3:07

24,134

Jim Fan

Jim Fan

@DrJimFan

Feb 24

And @yukez 's announcement: x.com/yukez/status/202463942…

Yuke Zhu @yukez

Feb 20

We have seen rapid progress in humanoid control — specialist robots can reliably generate agile, acrobatic, but preset motions. Our singular focus this year: putting generalist humanoids to do real work. To progress toward this goal, we developed SONIC (nvlabs.github.io/GEAR-SONIC/), a Behavior Foundation Model for real-time, whole-body motion generation that supports teleoperation and VLA inference for loco-manipulation. Today, we’re open-sourcing SONIC on GitHub. We are excited to see what the community builds upon SONIC and to collectively push humanoid intelligence toward real-world deployment at scale. 🌐 Paper: arxiv.org/abs/2511.07820 📃 Code: github.com/NVlabs/GR00T-Whol…

3:07

16,816

Jim Fan

Jim Fan

@DrJimFan

Feb 20

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning *inside* a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation *inside* a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains 17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:

1:17

176

1,234

208,689

Jim Fan

Jim Fan

@DrJimFan

Feb 20

- Project website: dreamdojo-world.github.io/ - Paper: arxiv.org/abs/2602.06949 - Code repo and model ckpts: github.com/NVIDIA/DreamDojo This is a huge team work at NVIDIA. All credits go to the wonderful teams who poured their hearts into it!

171

91,387

Jim Fan

Jim Fan

@DrJimFan

Feb 20

Check out @ShenyuanGao's technical deep dive: x.com/ShenyuanGao/status/202…

Shenyuan Gao

@ShenyuanGao

Feb 20

🤖 How can we enable zero-shot generalization to unseen scenarios for robot world models? Thrilled to share DreamDojo 🌎 — an interactive robot world model pretrained on 44K hours of human egocentric videos, the largest and most diverse dataset to date for robot world model learning. Our model not only excels in generalization, but also supports real-time interaction at 10 FPS after distillation. It enables several important applications, including live teleoperation, policy evaluation, and model-based planning at test time. 🔗 Project: dreamdojo-world.github.io/ 📰 Paper: arxiv.org/abs/2602.06949 🤗 Code & models & datasets: github.com/NVIDIA/DreamDojo #WorldModels #Robotics #EmbodiedAI #RL #AI #NVIDIA Sharing more details in the thread 🧵

1:17

17,684