13. VideoDeepResearch: Long Video Understanding With Agentic Tool Using
π Keywords: Long video understanding, multi-modal large language models, VideoDeepResearch, agentic systems, reasoning model
π‘ Category: Multi-Modal Learning
π Research Objective:
To challenge the common assumption that long video understanding (LVU) requires multi-modal large language models (MLLMs) with extended context windows and specialized capabilities.
π οΈ Research Methods:
Introduced VideoDeepResearch, a framework utilizing a text-only large reasoning model (LRM) paired with a modular multi-modal toolkit to approach LVU tasks through reasoning and selective content access.
π¬ Research Conclusions:
- VideoDeepResearch achieved notable improvements over existing MLLM baselines, with advances of 9.6%, 6.6%, and 3.9% on key benchmarks (MLVU, LVBench, LongVideoBench).
- The results indicate that agentic systems hold significant potential for addressing challenges in long video understanding.
π Paper link:
huggingface.co/papers/2506.1β¦