Nick Stracke

Nick Stracke

12 Photos and videos

Tweets

Pinned Tweet

Nick Stracke

@rmsnorm

Apr 14

Video diffusion models learn motion indirectly through pixels. But motion itself is much lower-dimensional. We introduce 64× temporally compressed motion embeddings that directly capture scene dynamics. This enables efficient planning -> 10,000× faster than video models. 🧵👇

0:19

337

48,604

Vai Viswanathan

Nick Stracke retweeted

Vai Viswanathan

@vai_viswanathan

Another crazy CVPR 2026 world model result. “Envisioning the Future” forecasts where points in a scene will move, step by step, from a single image. No dense video needed. - 3,000x faster than video models, 10x fewer parameters, and 5x more accurate under a fixed compute budget. - An autoregressive diffusion model rolls sparse point trajectories forward through short, predictable steps, modeling uncertainty as it grows. - Why it matters: Rollouts get cheap enough to simulate thousands of futures and plan over them, hitting 78% billiard planning accuracy vs 16% for the best dense video baseline.

829

F. Güney

Nick Stracke retweeted

F. Güney @ftm_guney

Jun 8

second runner up is a super cool paper from Björn Ommer’s group at LMU. they first encode point trajectories into a latent using a VAE. then they learn to denoise future trajectories conditioned on past. they do planning experiments on Libero by training a small decoder to convert denoised latents to robot action. I joked that this is latent stable diffusion all over again by Björn but this time with point trajectories 😄

1,398

clankr

Nick Stracke retweeted

clankr

@clankrmedia

Jun 7

ZipMo hands a robot its plan by daydreaming how the scene should move. No pixels, no rendering, no hour-long video to sit through. 10,000x faster than a top video model! Just a fast read on where everything's headed, and the robot runs with it. Give it a start frame and a goal in plain language. It predicts how the arm and the objects around it should move to get there. A thin policy head reads that motion plan and turns it into the next arm command. The head never sees the task, only the predicted motion. So all the reasoning lives in the motion model. The head is pure inverse dynamics, motion in, action out. On the LIBERO benchmark it replans on every new frame, and it beats the trajectory-based policy methods it lines up against: ATM, Tra-MoE, Amplify. It pulls this off by never touching pixels. It learns a compact motion space from tracked trajectories, then generates motion straight inside that space. And stranger still, compressing that space 64x makes the motion sharper, not worse. Congrats to the team, lovely work. @rmsnorm, @KoljaBauer, @StefanABaumann

0:20

Nick Stracke

@rmsnorm

Jun 7

Come visit our poster today (Sunday) 3:30 pm at CVPR poster 595! We'll also show some new results on translating the generated motion embeddings back to video 👇

1,459

Nick Stracke

Nick Stracke

@rmsnorm

Jun 7

Come visit our poster today (Sunday) 3:30 pm at CVPR poster 595! We'll also show some new results on translating the generated motion embeddings back to video 👇

Nick Stracke

@rmsnorm

Apr 14

0:19

5,271

Nick Stracke

Nick Stracke

@rmsnorm

Jun 7

Here using LTX 2! This allows us to efficiently explore many possible futures and render only the most relevant ones back into video.

0:05

167

Stefan Baumann

Nick Stracke retweeted

Stefan Baumann

@StefanABaumann

Jun 5

You don't need a video model to predict motion. We'll talk about how to be 3000× faster at CVPR Poster 634, this morning at 10:45. Drop by and check it out!

Stefan Baumann

@StefanABaumann

Apr 13

You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step. We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000× faster than video world models. Myriad, accepted at @CVPR 2026

5,109

jo.schb ✈️CVPR

Nick Stracke retweeted

jo.schb ✈️CVPR @jo_schb

Jun 4

⚠️ Standard first stages are not sufficient for safety-critical applications! The most extreme weather events are often the hardest to decode. One latent → many plausible reconstructions Deterministic decoders hide that uncertainty. Meet FREUD 🧵👇

0:02

1,269

Nick Stracke

Nick Stracke

@rmsnorm

Jun 1

Check out our work on how to scale NVS on internet-scale data! We provide fixes to the unsupervised NVS pipeline (RayZer) and also obtain more interpretable pose estimations while simplifying the overall setup.

Stefan Baumann

@StefanABaumann

Jun 1

The internet is full of video. So why can't novel view synthesis just scale on it? Real-world video is simultaneously unposed, messy, and dynamic, breaking self-supervised NVS. We fixed that. RayDer learns static-scene NVS from dynamic internet video, scaling like an LLM. A🧵

0:06

730

Nick Stracke

Nick Stracke

@rmsnorm

Apr 27

💡 Training with differently noised patches increases overall image gen performance, as the model learns a better underlying representation. This holds even for plain Euler sampling, but their sampler increases the gap even more!

jo.schb ✈️CVPR @jo_schb

Apr 27

Diffusion models treat every part of an image equally. → Same number of steps. Same compute. But images aren’t uniform. 🤔 Some regions are easy, others are hard. So why force the model to treat them the same? 🧵

0:05

4,379

Simo Ryu

Nick Stracke retweeted

Simo Ryu

@cloneofsimo

Apr 15

Cool stuff

Nick Stracke

@rmsnorm

Apr 14

0:19

7,716

Nick Stracke

Nick Stracke retweeted

Nick Stracke

@rmsnorm

Apr 14

Stop predicting motion step-by-step. Model the whole motion in a compact representation for efficient planning. 📄 Paper: arxiv.org/abs/2604.11737 💻 Models: compvis.github.io/long-term-… Joint work with @KoljaBauer, @StefanABaumann, @itsbautistam, Josh Susskind, and Björn Ommer.

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible...

arxiv.org

2,731

Miguel Angel Bautista

Nick Stracke retweeted

Miguel Angel Bautista

@itsbautistam

Apr 14

Amazing work led by @rmsnorm @KoljaBauer and our collaborators at LMU, to be presented at @CVPR! Personally, I find this question of "what's the right level of abstraction for planning in physical space?" to be very intriguing. Pixels over time are very low SNR (ie. the argument behind JEPA) but motion/trajectories carries a lot on information while being extremely compressible. I believe there's a lot more to uncover from this direction. Very glad to be part of this one!

Nick Stracke

@rmsnorm

Apr 14

0:19

1,881

Brian Roemmele

Nick Stracke retweeted

Brian Roemmele

@BrianRoemmele

Apr 14

A massive step forward for AI video!

Nick Stracke

@rmsnorm

Apr 14

0:19

5,144

atharva ☆

Nick Stracke retweeted

atharva ☆

@k7agar

Apr 14

I have been saying

Nick Stracke

@rmsnorm

Apr 14

0:19

5,297

Nick Stracke

Nick Stracke retweeted

Nick Stracke

@rmsnorm

Apr 14

Replying to @KoljaBauer @StefanABaumann @itsbautistam

1️⃣x.com/neerjathakkar/status/2… Also, shoutout to two other recent works that explore how to use point tracks for world modeling. 👇...

Neerja Thakkar

@neerjathakkar

Apr 2

What’s the right representation for a world model? 3D, pixels, or something else? Excited to release our new paper “Forecasting Motion in the Wild” where we propose point tracks as tokens for generating complex non-rigid motion and behavior From @GoogleDeepmind @Berkeley_AI @TTIC_Connect

2,185

Nick Stracke

Nick Stracke

@rmsnorm

Apr 14

0:19

337

48,604

more replies

Nick Stracke

Nick Stracke

@rmsnorm

Apr 14

1️⃣x.com/neerjathakkar/status/2… Also, shoutout to two other recent works that explore how to use point tracks for world modeling. 👇...

Neerja Thakkar

@neerjathakkar

Apr 2

2,185

Nick Stracke

Nick Stracke

@rmsnorm

Apr 14

2️⃣x.com/StefanABaumann/status/…

Stefan Baumann

@StefanABaumann

Apr 13

1,056

Kolja Bauer

Nick Stracke retweeted

Kolja Bauer @KoljaBauer

Apr 14

Do we really need pixel generation to model motion? 🤔 We show how directly representing motion in a compact space enables efficient, scalable planning. 10,000× faster than video models, enabling planning and reasoning in open-world and robotics settings. Check it out ⬇️

Nick Stracke

@rmsnorm

Apr 14

0:19

3,657