Weikai Huang

Weikai Huang

11 Photos and videos

Tweets

Mahtab Bigverdi retweeted

Weikai Huang

@weikaih04

Jun 12

What if VLMs could imagine before answering? IPT supervises visual intermediate states for spatial reasoning: 1. Path tracing → side view 2. Perspective taking → new viewpoint 3. Multiview counting → top-down map Paper: arxiv.org/abs/2606.03988

4,981

Weikai Huang

Mahtab Bigverdi retweeted

Weikai Huang

@weikaih04

Jun 10

What if VLMs could imagine visually before answering spatial questions? New paper: Imaginative Perception Tokens (IPT) teach multimodal LMs to reason about hidden 3D structure — without generating images at inference time. Paper: arxiv.org/abs/2606.03988

0:30

64,502

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

Jun 9

Picture your living room. If you sat on the sofa, would the TV be on your right or left? You didn't reason in words,you placed yourself in the scene.Imagining in visual space, not text.Exactly what VLMs can't do.Our new paper tackles this with Imaginative Perception Tokens(IPT)🧵

0:30

2,482

more replies

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

Jun 9

(8) Couldn't have done this without my amazing co-authors @LINJIEFUN & @weikaih04 , my advisors Linda Shapiro & @RanjayKrishna , and all my collaborators. So grateful to have worked with this team.

124

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

Jun 9

(9) 🖇Paper: arxiv.org/pdf/2606.03988 📊Code/Data: mahtabbigverdi.github.io/Ima…

Jae Sung Park

Mahtab Bigverdi retweeted

Jae Sung Park

@jjaesungpark

Jun 7

Molmo2 poster session happening today! Stop by to chat about fully open models for video understanding and grounding 👋 📍 Sun Jun 7, 11:45 AM – 1:45 PM Poster #3

Jae Sung Park

@jjaesungpark

Jun 3

Stop by and chat with the team! 👋

1,877

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

Jun 4

Ablate to validate poster at @cvpr today in Exhibit Hall A #51

Jack Zhang @JackZhang970191

May 24

Introducing Ablate-to-Validate (ATV) 🔬: a new diagnostic framework for vision-language models with visual reasoning tokens. VLMs increasingly emit and consume latent tokens to reason, but are those tokens actually being used? Let's take a closer look. 🧵 [1/8]

0:46

1,595

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

Jun 4

Some really cool slides from the Bitter Lessons Workshop speakers at @CVPR.

Anand Bhattad

@anand_bhattad

Jun 2

"Bitter Lessons" workshop tomorrow (Jun 3rd) starting from 08:45 am in Room 3A-3D at #CVPR2026. Full schedule below 👇 8:45 — Opening Remarks 9:00 — Bill Freeman: Bitter & non-bitter lessons 9:20 — Alyosha Efros: TBD 9:40 — @georgiagkioxari: The Bittersweet Lessons of Recognition 10:00 — Panel: Alyosha, Bill, Georgia & @dimadamen (mod: @unnatjain2010) ☕ 10:20 — Coffee Break 10:30 — @jon_barron: How bitter-lesson'ed is all of 3D vision? 10:50 — @vincesitzmann: TBD 11:10 — @BharathHarihar3: Mid-level vision is dead; long live mid-level vision! 11:30 — Panel: Jon, Vincent, Bharath (mod: @anand_bhattad) 11:50 — @ShenlongWang: Modeling the World After the Bitter Lesson 12:10 pm — @DerekHoiem_UofI: Advice to research by, and the next bitter lesson 12:30 pm — Panel: Shenlong, Derek & @ev4n3sce (mod: Lana Lazebnik)

7,105

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

Jun 3

If you're at #CVPR, stop by the exhibit hall and check out Poster #150 at the MUSI Workshop! We'll be presenting our paper on Imaginative Perception Tokens. Project website: mahtabbigverdi.github.io/Ima…

248

Mahtab Bigverdi

Mahtab Bigverdi @MahtabBg

May 24

After our Perception Tokens paper, we asked: are models truly reasoning over perception tokens or do they just benefit from extra reasoning budget? And why discrete tokens instead of richer continuous ones? With @JackZhang970191 we answer both in Ablate-to-Validate

Jack Zhang @JackZhang970191

May 24

0:46

3,654

Jack Zhang

Mahtab Bigverdi retweeted

Jack Zhang @JackZhang970191

May 24

0:46

7,126

Zixian Ma@CVPR

Mahtab Bigverdi retweeted

Zixian Ma@CVPR

@zixianma02

May 22

This is such an amazing use case of Molmo2 and MolmoWeb! 💕 It showcases again vision is crucial to various applications, and it’s rewarding to see the community builds on top of their strong perceptual capabilities for such real-world applications: “Works chose MolmoWeb because it's purpose-built for visual pointing on web pages…PointCheck then sends the same region to Molmo with a direct query about what's on the screen…The result is that PointCheck can confirm focus indicators are visually present on screen–not just defined somewhere in a stylesheet.” I also remember @rockpang6 mentioned seeing the failure case of MolmoWeb on some “difficult” websites, and upon a closer look it turned out that some websites are not just inaccessible to MolmoWeb bc they’re OOD but they actually appear inaccessible to us human users too (eg having to click on a very tiny part of a UI element) — so it’s a very cool full circle moment to see Works actually built out this accessibility checker with MolmoWeb! 🫡

Ai2

@allen_ai

May 21

Brendan Works is a product manager focused on paratransit services in Seattle. See how he built PointCheck, a website accessibility checker powered by our open Molmo, MolmoWeb, & Olmo 3 models. 👇

1,303

Jaemin Cho

Mahtab Bigverdi retweeted

Jaemin Cho

@jmin__cho

May 5

Excited to release MolmoAct 2, a fully open robot foundation model for real-world deployment! 🤖 We're shipping the full stack, including - Training data (MolmoAct2 Datasets) - Action tokenizer (MolmoAct2-FAST) - Architecture (action expert with per-layer KV conditioning) - Embodied reasoning pretraining checkpoint (MolmoAct2-ER) - Adaptive-depth reasoning checkpoint (MolmoAct2-Think) - Multiple embodimentment-specific checkpoints (DROID, Bimanual YAM, SO100/101) It was great to be part of the incredible team at @allen_ai for making this happen! Check out the full thread for more 👇

Ai2

@allen_ai

May 5

Robotics models often struggle outside controlled environments. Ours is built to work in real ones. Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵

1:41

7,805

Tanush

Mahtab Bigverdi retweeted

Tanush

@tanushyy

May 6

Remember action recognition? The days of trying to climb on Kinetics?👻 Announcing VideoNet, a CVPR 2026 Highlight 🎉 which revitalizes action recognition in the VLM era Explore our data with this fun, interactive demo: tanu.sh/videonet/data (1/8) 🧵

0:25

9,951