This beautiful work by
@cut_pow suggests diffusion models have at minimum learned to reproduce static monocular depth cues. Not clear if that's the same as learning 3D geometry. But if not, MiDaS has no 3D knowledge either; it uses the same cues. FYI
@jon_barron
1.2 It heavily relies on the quality of depth maps, and uses the assumption that SD has implicit knowledge of the scene geometry in an image. So therefore it can plausibly inpaint missing parts without explicitly knowing underneath 3D meshes of the scene