World models might need more compute than LLMs.. and LLMs already triggered one of the largest infrastructure buildout in history 🏭
LLMs learn the structure of language. World models learn the structure of causality: how objects collide, how fluids flow, how crowds behave, how a scene changes after an action.
This changes the compute math: Text is sparse. Video is dense, temporal, redundant, and expensive to curate. A book can fit in hundreds of kilobytes. A minute of useful video can be hundreds of megabytes before you even ask whether it contains the right objects, actions, camera angles, contacts, failures, and edge cases.
The trick is compression. World models do not predict every pixel in a 4K frame. They compress video into latent representations and learn to predict how those latent states evolve under perturbations: actions, camera motion, text prompts, or environmental change.
Even with this compression, the compute requirements are staggering. NVIDIA’s Cosmos world-model trained on 20M hrs of video with 10,000 H100 GPUs for roughly three months (10,000×90×24≈21.6M GPU-hours). At GPU-hour rates, that is ~ $100M of training compute. Owning comparable capacity outright would be a $400M capex decision 💸
But compute is not the only constraint... Data may be the harder one.
The internet gave LLMs massive corpus of text. Robotics has no equivalent internet-scale corpus of action-conditioned experience. Passive video is not enough. That data has to be generated through teleoperation, real robots, simulation, synthetic worlds, or some combination of all four.
And this is just training.
Inference may be an even bigger bottleneck. A robot does not just answer a prompt. It has to plan, simulate possible futures, handle uncertainty, and act safely in real time. World models shift part of that burden into learned representations. Instead of hand-coding every corner of physics, world models amortize some of that complexity into a neural network. The "stochastic messiness of reality"* gets baked into the weights.
That is the real shift... LLMs taught machines to read the internet. World models will teach machines to operate in reality.
The compute buildout for that may be much larger than people expect.
* I read this somewhere and liked how it captured the challenges of the real world.