Incredible work by
@lucasmaes_ and the team! Stable, end-to-end JEPA training from raw pixels is a massive leap forward.
I took the 15M parameter LeWorldModel, trained it from scratch, and pushed the local inference speed to the absolute limit.
The Engineering Pipeline:
β’ Cloud Training: 100k steps on the 43GB PushT dataset using an H100.
β’ The Bottleneck: To run local MPC planning on a consumer gaming GPU, the native PyTorch Epps-Pulley math for the SIGReg loss was memory-bound.
β’I wrote a custom Warp-Reduction CUDA kernel, fusing the mean, variance, and hinge-loss calculations directly into the SM registers.
β’ The Result: Crushed execution latency from 6.6ms down to 92Β΅s (a 72.4x speedup) and achieved an exact 86.0% zero-shot success rate.
What This Means for Robotics:
Giving a robot a "World Model"βa brain that can actually imagine and predict the physics of its environment before making a moveβused to require massive, expensive compute. By optimizing the underlying CUDA math, this just proved these forward-thinking robotic brains can run smoothly in real-time on a standard consumer laptop. Fast, cheap, and accessible robotic AI is here.
Custom CUDA kernel & local-to-cloud pipeline available in my fork here:
github.com/Kars07/le-wm-cudaβ¦
JEPA are finally easy to train end-to-end without any tricks!
Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics.
15M params, 1 GPU, and full planning <1 second.
π:
le-wm.github.io