Recommend an article shared by Zhihu contributor tomsheep: PKU & ByteDance just open-sourced Helios, a long video generation model hitting 19.5 FPS real-time inference on a single H100 GPU, crossing the "real-time video generation" threshold!
Only downside: 14B params are too heavy for consumer GPUs. Let's break down its key tech insights👇
🎬 Current State of Video Generation
Diffusion-based video models (Sora, Keling, Wan 2.x) produce stunning short clips but remain in offline short-video mode. The ultimate goal is a continuous, interactive "world model" (AI NPC visuals, real-time game rendering, endless video streams), requiring minute-level long consistency ultra-low latency, not just brute-force compute.
🛣️ 3 Core Paradigms Compared
1️⃣ Pure Diffusion Video
• Logic: Treat video as a 3D spatiotemporal tensor, learn global joint distribution
• Pros: Strong global consistency, smooth dynamic transitions (models: Wan 2.x, Sora)
• Cons: Extremely high compute complexity, easy memory explosion; global synchronous denoising, no streaming support
2️⃣ Pure AR Video
• Logic: Split video into time-axis tokens, frame-by-frame causal prediction
• Pros: Native KV-Cache support, theoretically unlimited long video (models: CogVideo, VideoPoet)
• Cons: Extreme compression causes detail loss; token-by-token generation is slow, error-prone
• P(v) = \prod_{t=1}^{T} P(x_t | x_{\lt t}) \\
3️⃣ AR-Diffusion (Helios Core)
• Logic: Time progression handled by AR, spatial/local dynamics by Diffusion
• Pros: Aligns with physical laws, supports streaming, balances long consistency & top visual quality, solving pain points of both prior paradigms
• P(x_{1:T}) = \prod_{t=1}^{T} P(x_t \mid x_{\lt t}) \\
🔧 Helios Core Design (Based on Wan-2.1-T2V-14B)
Instead of training from scratch, Helios performs "architectural surgery" on a top bidirectional diffusion model, inheriting its spatial generation capability and transforming it into a streaming autoregressive engine:
1️⃣ Unified History Injection 🧩
Ditches complex causal masking, uses Guidance Attention to strictly separate clean historical frames from noisy current frames, avoiding reverse contamination. A single architecture natively supports T2V/I2V/V2V tasks.
2️⃣ Lightweight Anti-Drift Mechanism 🚫💨
3 zero-overhead strategies to fix long-video error accumulation:
• Relative RoPE: Solves position drift & repeated actions
• First-Frame Anchor: Retains the first frame as a global anchor, fixing color/identity drift
• Frame-Aware Corrupt: Adds noise/adjusts exposure to historical frames during training, boosting long-sequence robustness
⚡ Inference Acceleration: Deep Flow Compression (Key to 19.5 FPS)
1️⃣ Token Perspective: Spatiotemporal Extraction 🗜️
• Multi-Term Memory Patchification: Divides history into short/medium/long windows, higher compression for older history, constant input tokens to avoid memory explosion
• Multi-Scale Denoising: Low-res for global structure early on, full-res for fine details later, doubling throughput without quality loss
2️⃣ Step Perspective: Extreme Compression 🚀
Combines Distribution Matching Distillation (DMD) Adversarial Post-Training, compressing denoising steps from dozens to 3, breaking the quality ceiling of traditional distillation for single-card real-time speed.
🎯 Key Helios Breakthroughs
✅ Paradigm fusion: AR-Diffusion scaled to "single-card real-time, infinite-length" engineering scale
✅ Clean history injection: Guidance Attention perfectly decouples history & current frame computation
✅ Lightweight anti-drift: 3 simple designs eliminate long-video error accumulation
✅ Constant token context: Avoids 3D attention memory explosion
✅ 3-step inference: Distillation adversarial training achieves real-time speed & quality
🔮 Future Outlook
14B params are still unfriendly to consumer GPUs — looking forward to a lightweight version! Helios is a major breakthrough for real-time AI video, game AI, and world models. The era of real-time long video generation is here🌊
📖 Full article:
zhihu.com/question/201509245…
#AI #VideoGeneration #DiffusionModel #Helios #ByteDance #PKU