Filter
Exclude
Time range
-
Near
Kwai Keye-VL-2.0: Open-Source MoE Model for Ultra-Long Video Understanding [LLMS] MoE model processes ultra-long videos. Why it matters: This model introduces a novel architectural approach to overcome computational and contextual challenges in processing hour-long videos. Its open-source nature and agentic intelligence capabilities could significantly advance multimodal AI applications, particularly in areas requiring deep temporal understanding. Follow DailyAIWire for the full brief. 🤔 How will the efficient processing of ultra-long video contexts reshape the development of autonomous AI agents? #MultimodalAI #MixtureOfExperts #LongVideoUnderstanding #OpenSourceAI #AgenticAI
1
36
🎥🧠 Video will be the next battle field after text and voice in the large foundation model competition. Currently MLLM agents are improving at long video reasoning, yet they still perceive passively by processing hours of content without knowing what matters. We introduce Active Video Perception (AVP), a framework that treats the video as an interactive environment and actively acquires compact, query-relevant evidence. @SFResearch AVP runs a Plan-Observe-Reflect loop: a planner targets promising segments based on the query, an observer extracts timestamped visual evidence directly from specific video parts, and a reflector evaluates whether the current evidence already supports an answer. 📊 Results on Long Video Understanding ✅ 5.7 percentage point gain over the strongest agentic baseline, achieving best performance across five challenging LVU benchmarks (MINERVA, LVBench, Video-MME, MLVU, LongVideoBench).  ✅ >5× faster inference speed, using only 12.4% of tokens and 18.4% of inference time through targeted evidence seeking instead of dense captioning. 📄 Paper: arxiv.org/abs/2512.05774 🌐 Project: activevideoperception.github… 🎥 YouTube (2-min video): youtu.be/15SxSE1A0Ow #LongVideoUnderstanding #VideoAgents #ComputerVision
8 Dec 2025
🚨 Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding 🚨 Introducing Active Video Perception: an evidence-seeking framework that treats the video as an interactive environment and acquires compact, query-relevant evidence. 🎬 Key Highlights: 🧠 Human-Inspired Active Perception AVP mimics how humans watch video by first skiming for global context, then focusing on a few critical moments. It treats video as interactive environments. 🔄 Iterative Evidence Seeking AVP runs a Plan–Observe–Reflect loop, dynamically querying video parts for fine-grained evidence and continually assessing whether it has enough information or needs to look deeper. 🚀 Efficiency Breakthrough: High accuracy meets low cost. AVP outperforms the best agentic approach by 5.7% accuracy while using just 12.4% of tokens and 18.4% inference time. How does AVP transform passive video processing into active, agentic exploration? Dive into the details below! 🧵
1
12
46
7,428
8 Dec 2025
Great collaboration with Honglu Zhou @zhou_honglu, Shijie Wang @ShijieWang20, Junnan Li @LiJunnan0409, Caiming Xiong @CaimingXiong, Silvio Savarese @silviocinguetta, Mohit Bansal @mohitban47, Michael S. Ryoo @ryoo_michael, Juan Carlos Niebles @jcniebles! 🤜🤛✨ @SFResearch @unc_ai_group @unccs Want to know more? 📄 Paper: arxiv.org/pdf/2512.05774 🌐 Project: activevideoperception.github… 🎥 YouTube (2-min video): youtu.be/15SxSE1A0Ow #LongVideoUnderstanding #VideoReasoning #ComputerVision

1
7
362
The Convergence of “Understanding × Generation” in Long Video — Attention Sink ✨🎬🧠 We recently open-sourced two works related to long videos: long-video understanding StreamingVLM (github.com/mit-han-lab/strea…) and long-video generation LongLive (github.com/NVlabs/LongLive). Both papers validated the effectiveness of Attention Sink (originating from StreamingLLM - arxiv.org/pdf/2309.17453) through experiments and adopted it as a core component. As a co-author on both works, I’d like to briefly introduce how Attention Sink is used in long-video understanding and generation, and how this differs from its usage in LLMs. 🔗📄 1. Why are long videos hard? 🤔⏳ Long video is an “ultra-long context” scenario. Whether for StreamingVLM’s understanding or LongLive’s generation, we deal with millions of tokens. Full attention makes computation explode—training and inference costs become prohibitive, and real-time/interactive use is essentially impossible. We therefore need an approach that preserves quality while remaining efficient. ⚖️⚡ 2. What is Attention Sink? 🧲🧩 Attention Sink was first proposed in the LLM setting by StreamingLLM: insert a set of “anchor” tokens (sink tokens) early in the attention sequence and increase their salience (e.g., larger key norms or special embeddings) so that tokens at any later position can reliably attend back to these global-memory anchors. Combined with Window Attention, the model’s logits are less likely to collapse when prompts change, yielding more stable behavior; the extra overhead is nearly cost-free, because the number of sink tokens is fixed. 🧮✅ 3. On the “understanding” side: How does StreamingVLM use it? 🧐🎥 Attention Sink Sliding Window. The sink serves as a global prior for long-video understanding, persistently retaining information that does not quickly become outdated (e.g., players in a sports broadcast), improving stability across shots. 📈 4. On the “generation” side: How does LongLive use it? 🎨⚙️ Attention Sink Window Attention KV-recache. The sink acts as a global prior in long-video generation, maintaining stylistic and narrative consistency during generation; KV-recache refreshes the cache at prompt switch points to ensure smooth transitions. 🔁🎞️ 5. Same hammer, different nails 🔨🔩 • In long-video understanding, the sink functions like a retrieval prior, helping the model stay on the main storyline. 🧭 • In long-video generation, the sink acts like a visual metronome, keeping overall style from drifting. 🎼 6. How it differs from Attention Sink in StreamingLLM 🔍📚 In both long-video understanding and generation, the usage barrier is higher than in LLMs. • On one hand, in LLMs it can be used inference-only without training; in StreamingVLM and LongLive, we need fine-tuning to adapt the model to this mechanism. 🛠️ • On the other hand, there are more sink tokens: for example, in LongLive we construct sink tokens in the first 3 frames, leading to more sinks than in StreamingLLM. 📦 One reason is that pure text models are trained on corpora with natural anchors like BOS, paragraph openings, and titles, so early-position signals are already strong in attention statistics. Video data lacks a stable “global-anchor paradigm” (frames are homogeneous streams and scenes vary widely), so injecting sinks at inference time can easily mismatch—hence the need for fine-tuning to “teach” the model how to use them. 🎯 #LongVideoGeneration #LongVideoUnderstanding #RealTimeGeneration #Multimodal
2
12
72
11,685
📢☑️Video-RAG: Training-Free Retrieval for Long-Video LVLMs In this week’s deep dive, we implement Video-RAG as a training-free, single-pass pipeline and integrate it with LLaVA-Video-7B (Qwen2, 32K context), without APE - to keep things reproducible on today’s stacks. We enable OCR ASR retrieval using EasyOCR Whisper with Contriever FAISS for search, and show how to tune thresholds, frame counts, and context to get reliable gains on narrative/identity questions. 👉🏼What’s Covered? ☑️The Video-RAG intuition: replacing excess frames with visually aligned text ☑️Building OCR/ASR stores (EasyOCR, Whisper) and indexing with Contriever FAISS ☑️Integration with LLaVA-Video-7B ☑️No-APE implementation: why APE is disabled ☑️Step-by-step setup modified files (Download → replace → run) for fast reproducibility ☑️Compute & storage planning: model sizes, disk budgets, and running on RunPod / Lightning Studios (Python 3.10) ☑️Troubleshooting the usual suspects This blog post is a practical, open-source path to long-video comprehension - no fine-tuning, no proprietary agents - focused on OCR ASR Video-RAG you can deploy today. 👉🏼Read More: learnopencv.com/video-rag-fo… #VideoRAG #LVLM #RetrievalAugmentedGeneration #LongVideoUnderstanding #OCR #ASR #LLaVA #Qwen2 #FAISS #Whisper #CLIP #SigLIP #ComputerVision #MultimodalAI
1
4
352
jiqizhixin.com/articles/2025… - 港大與百度合作開發超長影片理解引擎VideoRAG。 - 單張RTX 3090就能處理數百小時超長影片。 - VideoRAG建構多模態知識索引框架,將影片濃縮成結構化知識圖譜。 - 建立LongerVideos基准数据集,包含160多個影片,最長影片為《黑悟空》遊戲攻略(約21小時)。 - 在多項評估中,VideoRAG表現優於其他現有模型。 #VideoRAG #LongVideoUnderstanding #AI

4
286