The internet is full of video. So why can't novel view synthesis just scale on it?
Real-world video is simultaneously unposed, messy, and dynamic, breaking self-supervised NVS.
We fixed that. RayDer learns static-scene NVS from dynamic internet video, scaling like an LLM. A🧵