So these researchers figured out you can basically hallucinate 3D cities into existence using just satellite photos & a diffusion model.
The problem's pretty straightforward: satellites only see rooftops. Building facades? Invisible. Street-level detail? Doesn't exist. But people want flyable 3D environments, which means you need all that occluded geometry.
When I worked on google maps photogrammetry, we could only use satellite-based 3D for isolated stuff like the pyramids - anything city-scale required airplane flyovers. Which is fine until you hit aerial-denied regions where you literally can't fly. Huge chunks of the world just unavailable.
Their trick is honestly kind of beautiful. They train gaussian splats on satellite views, but as it descends toward ground level, the renders turn to absolute garbage - artifacts everywhere. Instead of fighting this, they just treat those nightmare renders as the input to a diffusion model. Basically - "hey FLUX, fix this mess."
Then here's where it gets clever: they generate multiple diffusion samples per view instead of committing to one. Because any single denoising path is probably wrong in 3D space, but if you generate a couple and let the GS optimization find consensus across them, you get actual geometric consistency.
They do this in episodes, curriculum style - start high, gradually descend (hence the name Skyfall-GS!). With each iteration the ground-level views get less fucked. By the end you've got real-time flyable cities that look surprisingly real, and the geometry still matches the satellite input.
No 3D training data. No street-level photos. Just satellites diffusion doing what it does best - filling in the blanks. It's like neural scene completion but actually practical, and it unlocks basically the entire world.