Our paper “Learning Situated Awareness in the Real World” has been accepted to
#ICML2026 as a Spotlight (top 2.2%)! Congrats to
@_Chuhan_Li and the team!
We introduce SAW-Bench, a real-world egocentric benchmark for evaluating situated (first-person) spatial reasoning in multimodal models, uncovering a large gap between humans and current systems. All videos recorded via Meta Ray-Ban 2 glasses!
Human perception is inherently situated – we understand the world relative to our own body, viewpoint, and motion.
To deploy multimodal foundation models in embodied settings, we ask:
“Can these models reason in the same observer-centric way?”
We study this through SAW-Bench: a novel benchmark for observer-centric situated awareness:
- 786 real world egocentric videos
- 2,071 human-annotated QA pairs
Across all tasks, we evaluate 24 state-of-the-art MFMs:
📉 Best model: 53.9%
🧑 Humans: 91.6%
Models systematically:
❌ Confuse head rotation with physical movement
❌ Collapse under multi-turn trajectories
❌ Fail to maintain persistent world-state memory
👉 We see that maintaining a stable observer-centric representation remains challenging.
As MFMs are increasingly integrated into embodied agents, situated awareness becomes essential for reliable real-world interaction.
We release SAW-Bench and encourage further research toward improving observer-centric reasoning in multimodal foundation models.