Diffusion transformers waste compute by treating every pixel equally, regardless of content complexity. ELIT (Elastic Latent Interface Transformer) fixes this with a simple idea: insert a variable length set of latent tokens that learn where to spend computation.
Two lightweight cross attention layers (Read/Write) route information between spatial tokens and latents. The model learns importance ordering during training by randomly dropping tail latents, so earlier tokens capture global structure while later ones handle fine details.
Results on ImageNet 1K at 512px: 35.3% better FID, 39.6% better FDD scores, ~33% cheaper classifier free guidance. Works across DiT, UViT, HDiT, and MMDiT architectures with no changes to the training objective.
By Moayed Haji Ali,
@vislang (Rice University),
@SergeyTulyakov, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov and team at
@Snap.
Accepted at CVPR 2026.