In recent years, academic and industry work in generative modeling has drifted so far apart that they are playing totally different games, and techniques that work in academia may not transfer to industry problems.
The divide isn't just about scale -- the different tasks in academia vs industry lead to different fundamental challenges.
Academic work focuses on class-conditional ImageNet generation. This has a very weak conditioning signal (single categorical label) and the problem is very data-constrained, with all SOTA methods training for hundreds of epochs. The main challenge in this regime is combatting overfitting.
Industry work on image or video generation usually has a much richer conditioning signal (e.g. very long captions, input images, etc) and is almost always underfitting since data can be scaled to absurd degrees. Overfitting (at least for pretraining) isn't a concern; instead we want to fit the complex data distribution *as fast as possible*.
We hope that GPIC is approachable on the academic budgets people are already expending on ImageNet, but will lead to problems more similar to the industry-scale challenges in generative modeling.