As NVIDIA pushed its first ever open weights autopilot model alpamayo-R1
A key frontier is native 4D understanding: not just โdescribe the videoโ but reason about depth motion 3D interactions over time, even for a specific region.
This paper introduces 4D-RGPT (also from NVIDIA!), which tackles this by distilling 4D perception (depth motion cues) from a frozen expert into an MLLM during training, using both latent-feature and explicit-signal distillation, plus timestamp positional encodings, with training-only modules (so no extra inference cost).
On a new benchmark they developed for region-level undestanding, R4D-Bench, this approach achieves 4.3% across the bench! And tops non-region based 3D & 4D benchmarks by 5.3%