Aaron Huang

Aaron Huang

Users
Tweets

Aaron Huang

@AaronWeiHuang

May 26

🚀 LongLive 2.0 gets another speed boost! We further optimized the NVFP4 inference path, improving overall throughput by 18.6%. 🎬 A 64s video now takes just 30.6s end-to-end, including VAE decoding. ⚡ That’s over 2x real-time generation. 🔧 Highlights: 🧩 Fused Triton kernels ⚙️ In-place quantized KV-cache updates ⚡ Faster FP4 KV dequantization 📌 Pinned VAE transfers 🛡️ Safer LoRA-before-quantization setup 🔗 Code: github.com/NVlabs/LongLive #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #NVIDIA

156

Yukang Chen

Yukang Chen

@yukangchen_

May 25

🚀 LongLive 2.0 just got faster! Since last week’s release, we further optimized the NVFP4 inference path and improved the overall throughput by 18.6%. 🔥Now, generating a 64s video takes only 30.6s end-to-end, including VAE decoding. ⚡⚡That’s over 2× real-time generation. 🛠️ What changed under the hood? • Fused Triton RoPE / adaLN kernels • Reduced KV-cache synchronization overhead • In-place quantized KV-cache updates • Faster FP4 KV dequantization • Pinned VAE transfers • Safer LoRA-before-quantization setup 🎬 LongLive 2.0 is our open-source 4-bit long-video generation infra for both training and inference. 🚀 We are continuing to push long-video generation toward faster, lighter, and more practical deployment. 🔗 Code: github.com/NVlabs/LongLive #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

3,013

Yukang Chen

Yukang Chen

@yukangchen_

May 19

🚀 Excited to release LongLive 2.0! 🎬 An end-to-end infrastructure for long video generation, with FP4 and parallelism at the core of both training and inference. ⚡45.7 FPS generation speed on 5B model⚡ ✨ LongLive 2.0 supports real-video training, few-step distillation, multi-shot training/inference, sequence-parallel acceleration, NVFP4 KV cache, and async VAE decoding deployment. 🧩 To our knowledge, this is the first open-source 4-bit long video generation infra that covers both training and inference. 🙌 Welcome to check it out, try it, and share feedback! 🔗 Code: github.com/NVlabs/LongLive 📰 Paper: huggingface.co/papers/2605.1… 🎥 Demo: nvlabs.github.io/LongLive/Lo… #LongVideoGeneration #VideoGeneration #Realtime #AIInfra #EfficientAI #FP4 #Parallel #NVIDIA

1:32

237

57,218

Yukang Chen

Yukang Chen

@yukangchen_

16 Oct 2025

The Convergence of “Understanding × Generation” in Long Video — Attention Sink ✨🎬🧠 We recently open-sourced two works related to long videos: long-video understanding StreamingVLM (github.com/mit-han-lab/strea…) and long-video generation LongLive (github.com/NVlabs/LongLive). Both papers validated the effectiveness of Attention Sink (originating from StreamingLLM - arxiv.org/pdf/2309.17453) through experiments and adopted it as a core component. As a co-author on both works, I’d like to briefly introduce how Attention Sink is used in long-video understanding and generation, and how this differs from its usage in LLMs. 🔗📄 1. Why are long videos hard? 🤔⏳ Long video is an “ultra-long context” scenario. Whether for StreamingVLM’s understanding or LongLive’s generation, we deal with millions of tokens. Full attention makes computation explode—training and inference costs become prohibitive, and real-time/interactive use is essentially impossible. We therefore need an approach that preserves quality while remaining efficient. ⚖️⚡ 2. What is Attention Sink? 🧲🧩 Attention Sink was first proposed in the LLM setting by StreamingLLM: insert a set of “anchor” tokens (sink tokens) early in the attention sequence and increase their salience (e.g., larger key norms or special embeddings) so that tokens at any later position can reliably attend back to these global-memory anchors. Combined with Window Attention, the model’s logits are less likely to collapse when prompts change, yielding more stable behavior; the extra overhead is nearly cost-free, because the number of sink tokens is fixed. 🧮✅ 3. On the “understanding” side: How does StreamingVLM use it? 🧐🎥 Attention Sink Sliding Window. The sink serves as a global prior for long-video understanding, persistently retaining information that does not quickly become outdated (e.g., players in a sports broadcast), improving stability across shots. 📈 4. On the “generation” side: How does LongLive use it? 🎨⚙️ Attention Sink Window Attention KV-recache. The sink acts as a global prior in long-video generation, maintaining stylistic and narrative consistency during generation; KV-recache refreshes the cache at prompt switch points to ensure smooth transitions. 🔁🎞️ 5. Same hammer, different nails 🔨🔩 • In long-video understanding, the sink functions like a retrieval prior, helping the model stay on the main storyline. 🧭 • In long-video generation, the sink acts like a visual metronome, keeping overall style from drifting. 🎼 6. How it differs from Attention Sink in StreamingLLM 🔍📚 In both long-video understanding and generation, the usage barrier is higher than in LLMs. • On one hand, in LLMs it can be used inference-only without training; in StreamingVLM and LongLive, we need fine-tuning to adapt the model to this mechanism. 🛠️ • On the other hand, there are more sink tokens: for example, in LongLive we construct sink tokens in the first 3 frames, leading to more sinks than in StreamingLLM. 📦 One reason is that pure text models are trained on corpora with natural anchors like BOS, paragraph openings, and titles, so early-position signals are already strong in attention statistics. Video data lacks a stable “global-anchor paradigm” (frames are homogeneous streams and scenes vary widely), so injecting sinks at inference time can easily mismatch—hence the need for fine-tuning to “teach” the model how to use them. 🎯 #LongVideoGeneration #LongVideoUnderstanding #RealTimeGeneration #Multimodal

GitHub - mit-han-lab/streaming-vlm: StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM: Real-Time Understanding for Infinite Video Streams - mit-han-lab/streaming-vlm

github.com

11,685