Computer Science PhD student @uwcse, Student Researcher @allen_ai

Joined September 2022
11 Photos and videos
Mahtab Bigverdi retweeted
What if VLMs could imagine before answering? IPT supervises visual intermediate states for spatial reasoning: 1. Path tracing → side view 2. Perspective taking → new viewpoint 3. Multiview counting → top-down map Paper: arxiv.org/abs/2606.03988
5
17
67
4,981
Mahtab Bigverdi retweeted
What if VLMs could imagine visually before answering spatial questions? New paper: Imaginative Perception Tokens (IPT) teach multimodal LMs to reason about hidden 3D structure — without generating images at inference time. Paper: arxiv.org/abs/2606.03988
5
9
60
64,502
Picture your living room. If you sat on the sofa, would the TV be on your right or left? You didn't reason in words,you placed yourself in the scene.Imagining in visual space, not text.Exactly what VLMs can't do.Our new paper tackles this with Imaginative Perception Tokens(IPT)🧵
1
11
24
2,482
(8) Couldn't have done this without my amazing co-authors @LINJIEFUN & @weikaih04 , my advisors Linda Shapiro & @RanjayKrishna , and all my collaborators. So grateful to have worked with this team.
1
124
Mahtab Bigverdi retweeted
Molmo2 poster session happening today! Stop by to chat about fully open models for video understanding and grounding 👋 📍 Sun Jun 7, 11:45 AM – 1:45 PM Poster #3
Stop by and chat with the team! 👋
3
23
1,877
Ablate to validate poster at @cvpr today in Exhibit Hall A #51
Introducing Ablate-to-Validate (ATV) 🔬: a new diagnostic framework for vision-language models with visual reasoning tokens. VLMs increasingly emit and consume latent tokens to reason, but are those tokens actually being used? Let's take a closer look. 🧵 [1/8]
2
8
1,595
Some really cool slides from the Bitter Lessons Workshop speakers at @CVPR.
"Bitter Lessons" workshop tomorrow (Jun 3rd) starting from 08:45 am in Room 3A-3D at #CVPR2026. Full schedule below 👇 8:45 — Opening Remarks 9:00 — Bill Freeman: Bitter & non-bitter lessons 9:20 — Alyosha Efros: TBD 9:40 — @georgiagkioxari: The Bittersweet Lessons of Recognition 10:00 — Panel: Alyosha, Bill, Georgia & @dimadamen (mod: @unnatjain2010) ☕ 10:20 — Coffee Break 10:30 — @jon_barron: How bitter-lesson'ed is all of 3D vision? 10:50 — @vincesitzmann: TBD 11:10 — @BharathHarihar3: Mid-level vision is dead; long live mid-level vision! 11:30 — Panel: Jon, Vincent, Bharath (mod: @anand_bhattad) 11:50 — @ShenlongWang: Modeling the World After the Bitter Lesson 12:10 pm — @DerekHoiem_UofI: Advice to research by, and the next bitter lesson 12:30 pm — Panel: Shenlong, Derek & @ev4n3sce (mod: Lana Lazebnik)
7
46
7,105
If you're at #CVPR, stop by the exhibit hall and check out Poster #150 at the MUSI Workshop! We'll be presenting our paper on Imaginative Perception Tokens. Project website: mahtabbigverdi.github.io/Ima…
10
248
After our Perception Tokens paper, we asked: are models truly reasoning over perception tokens or do they just benefit from extra reasoning budget? And why discrete tokens instead of richer continuous ones? With @JackZhang970191 we answer both in Ablate-to-Validate
Introducing Ablate-to-Validate (ATV) 🔬: a new diagnostic framework for vision-language models with visual reasoning tokens. VLMs increasingly emit and consume latent tokens to reason, but are those tokens actually being used? Let's take a closer look. 🧵 [1/8]
4
22
3,654
Mahtab Bigverdi retweeted
Introducing Ablate-to-Validate (ATV) 🔬: a new diagnostic framework for vision-language models with visual reasoning tokens. VLMs increasingly emit and consume latent tokens to reason, but are those tokens actually being used? Let's take a closer look. 🧵 [1/8]
1
4
14
7,126
Mahtab Bigverdi retweeted
This is such an amazing use case of Molmo2 and MolmoWeb! 💕 It showcases again vision is crucial to various applications, and it’s rewarding to see the community builds on top of their strong perceptual capabilities for such real-world applications: “Works chose MolmoWeb because it's purpose-built for visual pointing on web pages…PointCheck then sends the same region to Molmo with a direct query about what's on the screen…The result is that PointCheck can confirm focus indicators are visually present on screen–not just defined somewhere in a stylesheet.” I also remember @rockpang6 mentioned seeing the failure case of MolmoWeb on some “difficult” websites, and upon a closer look it turned out that some websites are not just inaccessible to MolmoWeb bc they’re OOD but they actually appear inaccessible to us human users too (eg having to click on a very tiny part of a UI element) — so it’s a very cool full circle moment to see Works actually built out this accessibility checker with MolmoWeb! 🫡
May 21
Brendan Works is a product manager focused on paratransit services in Seattle. See how he built PointCheck, a website accessibility checker powered by our open Molmo, MolmoWeb, & Olmo 3 models. 👇
1
2
8
1,303
Mahtab Bigverdi retweeted
Excited to release MolmoAct 2, a fully open robot foundation model for real-world deployment! 🤖 We're shipping the full stack, including - Training data (MolmoAct2 Datasets) - Action tokenizer (MolmoAct2-FAST) - Architecture (action expert with per-layer KV conditioning) - Embodied reasoning pretraining checkpoint (MolmoAct2-ER) - Adaptive-depth reasoning checkpoint (MolmoAct2-Think) - Multiple embodimentment-specific checkpoints (DROID, Bimanual YAM, SO100/101) It was great to be part of the incredible team at @allen_ai for making this happen! Check out the full thread for more 👇
May 5
Robotics models often struggle outside controlled environments. Ours is built to work in real ones. Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵
11
48
7,805
Mahtab Bigverdi retweeted
Remember action recognition? The days of trying to climb on Kinetics?👻 Announcing VideoNet, a CVPR 2026 Highlight 🎉 which revitalizes action recognition in the VLM era Explore our data with this fun, interactive demo: tanu.sh/videonet/data (1/8) 🧵
3
23
58
9,951