🎯 Prior normalizing flows (NFs) such as STARFlow add random noise to VAE latents as augmentation, leading to complex pipelines with extra noising and denoising steps.
End-to-end training is appealing, but previous attempts (e.g. REPA-E) observed latent collapse when naively training VAEs and generative models together.
💡 We propose SimFlow — a simple, end-to-end training framework for NFs. The key idea is surprisingly simple: fix the variance predicted by the VAE encoder to a constant (e.g. 0.5^2).
- Simple. The encoder outputs a broader latent distribution, and the decoder learns to reconstruct clean images directly — no extra noise or denoising design (unlike STARFlow).
- End-to-end. Fixed variance simplifies the ELBO and stabilizes joint training of NF VAE.
- Effective. SimFlow achieves new state-of-the-art among NFs on ImageNet 256×256 and 512×512.
🙌 Joint work with Guangting Zheng, Tao Yang, Rui Zhu, Xingjian Leng (
@xingjian_leng), Stephen Gould (
@sgould_au), and Liang Zheng (
@LiangZheng_06).
🙏 Huge thanks to other colleagues at ByteDance and Hanhong Zhao for many inspiring discussions.
🔗 Project:
qinyu-allen-zhao.github.io/S…
⭐ Code:
github.com/ByteDance-Seed/Si…
🤖 Models:
huggingface.co/QinyuZhao1116…
📄 Preprint:
arxiv.org/abs/2512.04084