We will present “Seeing without Pixels” sites.google.com/view/seeing… at ExHall A, Poster #248, Sunday (today) from 15:30 to 17:30 @CVPR !! This work just won a CVPR Compute Transparency Champion award! Welcome to come by and say hi! #CVPR2026#CVPR
Excited to share our latest work! Grateful for the guidance from all my collaborators, and special thanks to Tengda for being such an amazing mentor during my internship @GoogleDeepMind 😊
Fun fact: the apartment we rented in Oxford Summertown from 2021-2023 has accommodated @elliottszwu (CVPR20 best paper awardee) and @jianyuan_wang (CVPR25 best paper awardee) as subletters. Now also @ChuhanZhang5 - new CVPR26 best paper awardee. Grateful to be at @Oxford_VGG !!
Huge congrats to the team, D4RT is a team work and all the authors have been working very hard on this in the past one year. Very well deserved. 🍻 and thank you Award Committee Members for the recognition.
Huge congrats to the team, D4RT is a team work and all the authors have been working very hard on this in the past one year. Very well deserved. 🍻 and thank you Award Committee Members for the recognition.
Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.
Human learns from unique data -- everyone's OWN life -- but our visual representations eventually align. In our recent work "Unique Lives, Shared World" @GoogleDeepMind, we train models with "single-life" videos from distinct sources, and study their alignment and generalisation.
I’m looking for PhD students in Audio & Video for a Summer 2026 internship at Google DeepMind!
⚠️ Requirement: Prior publication in this area.
To apply, tell me the most critical research gap in AV understanding to see if we are a match! docs.google.com/forms/d/1qTv…
A SOTA model on 4D reconstruction from @GoogleDeepMind! Amazing work from @ChuhanZhang5 and the team! It was so satisfactory to see these reconstruction results and I've been having a great experience using it
A SINGLE encoder decoder for all the 4D tasks!
We release 🎯 D4RT (Dynamic 4D Reconstruction and Tracking).
📍 A simple, unified interface for 3D tracking, depth, and pose
🌟 SOTA results on 4D reconstruction & tracking
🚀 Up to 100x faster pose estimation than prior works
Future AI models will learn predominantly post-deployment – to do the tasks of interest to each user. This will happen throughout an individual “life”. In a new paper arxiv.org/pdf/2512.04085 we lay out groundwork for this type of capabilities in the wild from a visual standpoint.
Human learns from unique data -- everyone's OWN life -- but our visual representations eventually align. In our recent work "Unique Lives, Shared World" @GoogleDeepMind, we train models with "single-life" videos from distinct sources, and study their alignment and generalisation.
Excited to share our latest work! Grateful for the guidance from all my collaborators, and special thanks to Tengda for being such an amazing mentor during my internship @GoogleDeepMind 😊
Human perception is active: we move around to see, and we see with intention. In our latest work "Seeing without Pixels", we find "how you see" (how the camera moves) roughly reveals "what you do" or "what you observe" -- and this connection can be easily learned from data.
Can you tell which action corresponds to which camera trajectory in the video above? Check out our paper for answers! Work done by our great intern Sherry Xue @sherryx90099597 at @GoogleDeepMind, and with Kristen Grauman, @dimadamen and Andrew Zisserman.
arxiv.org/abs/2511.21681
Animated movies can be effortlessly understood by young minds, but appear to be challenging for video-language models, why? The key problem is the huge diversity of animated characters -- their appearance ranges from human-like faces, to cars, fish, blobs, etc.
A belated post for our ACMMM paper: we recognize and track animated characters for movie understanding tasks. Great work from Zhongrui Gui, also with @JunyuXieArthur@WeidiXie and Andrew Zisserman from @Oxford_VGG .
Project page with code and dataset: robots.ox.ac.uk/~vgg/researc…