✨ Introducing a new
#SOTA action recognition large multimodal language model:
#LLaVAction!
Understanding human behavior requires recognizing actions—a challenging task given the complexity of behavior. Large multimodal language models (
#MLLMs) offer a promising path forward, but how well do they perform in action recognition?
In our latest work - by
@shaokaiyeah, Haozhe Qi,
@TrackingPlumes and me |
@EPFL_en - we rigorously evaluate and enhance MLLMs for action recognition in a real-world and challenging settings- egocentric views in the kitchen! 🧑🍳🔪🧽🤖
👀 We find that developing a multi-question-answer (
#MQA) task serves as a valuable intermediate step in training (and evaluating) MLLMs for action understanding. Namely, we introduce EPIC-KITCHENS-100-MQA, a reformulation of the highly challenging EPIC-KITCHENS-100 dataset into a video multiple-choice question-answering task which allows for rigorous benchmarking of MLLMs in this task.
Next, we propose methods that substantially improve MLLM performance, and even achieving state-of-the-art results 🏆 (
#SOTA) on the EPIC-KITCHENS-100 validation set 🔥✨.
Our approach also outperforms GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA and demonstrates improvements across other action-related video benchmarks, including
#VideoMME,
#PerceptionTest, and
#MVBench.
Our
#LLaVAction-7B and -0.5B models can do
#MQA and, critically, can do video captioning! 🙏🚀
As MLLMs become central to AI-driven video understanding in such real-world settings, ensuring their robustness in real-world tasks is critical. Excited to push the boundaries of multimodal AI further! 💪
🇨🇭We could not have done this without the amazing support of
#SwissAI: the Swiss AI Initiative & the Swiss National Supercomputing Centre (
#CSCS).
@EPFL_AI_Center
#ProjectPage:
mmathislab.github.io/llavact…
📝
#arXivPaper:
arxiv.org/abs/2503.18712
👩💻💻 GitHub code & Google
#ColabDemo:
github.com/AdaptiveMotorCont…
🤗 Hugging Face models (use with transformers):
huggingface.co/MLAdaptiveInt…
#AI #MultimodalLearning #ActionRecognition #EPICKITCHENS100 #MLLM #LLaVAction #VideoCaptioning #VLMs