Joined July 2022
39 Photos and videos
Pinned Tweet
๐—ข๐—ป๐—ฒ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—ฐ๐—ฎ๐—ปโ€™๐˜ ๐—ฟ๐˜‚๐—น๐—ฒ ๐˜๐—ต๐—ฒ๐—บ ๐—ฎ๐—น๐—น. We present ๐—Ÿ๐—ผ๐—š๐—ฒ๐—ฅ, a new ๐—ต๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† architecture for long-context geometric reconstruction. LoGeR enables stable reconstruction over up to ๐Ÿญ๐Ÿฌ๐—ธ ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜€ / ๐—ธ๐—ถ๐—น๐—ผ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ, with ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ-๐˜๐—ถ๐—บ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด in sequence length, ๐—ณ๐˜‚๐—น๐—น๐˜† ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ณ๐—ผ๐—ฟ๐˜„๐—ฎ๐—ฟ๐—ฑ inference, and ๐—ป๐—ผ ๐—ฝ๐—ผ๐˜€๐˜-๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป. Yet it matches or surpasses strong optimization-based pipelines. (1/5) @GoogleDeepMind @Berkeley_AI
64
446
3,399
560,327
Junyi Zhang retweeted
New paper: AsymFlow๐Ÿ”ฅ JiT x0-prediction is not enough for pixel generation. Better keep velocity in a low-rank subspace: - 1.57 FID on ImageNet (best pixel flow model) - Finetunes FLUX.2 klein into pixel space, beats the original on HPSv3/DPG/GenEval (#1 overall on HPSv3) 1/7
20
55
282
54,298
Junyi Zhang retweeted
๐Ÿ‘€Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware. ๐ŸŒProject: statefulvisualencoders.githuโ€ฆ ๐Ÿ“ฐPaper: arxiv.org/abs/2606.04433 ๐Ÿ’ปCode: github.com/StatefulVisualEncโ€ฆ 1/n
4
38
248
50,809
Just arrived at Denver for CVPR! I will be presenting LoGeR at e2e3d.github.io/poster.html 5pm today and 4dvisionworkshop.github.io 4:30pm tomorrow (oral talk). Stop by if you are interested!
1
3
42
2,329
Pose is so essential for grounding humans in the 3D world -- and the same applies for MLLMs. โ€‹Happy to share Cambrian-P, a year-long collaboration with NYU/FAIR. We introduced a simple pose token to MLLMs, and it just works!
Camera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics in video. We trace this gap to a missing piece: camera pose. Introducing Cambrian-P: a multimodal LLM natively grounded in camera pose. (1/n)
7
55
6,306
Junyi Zhang retweeted
Introducing VGGT-ฮฉ: scaling feed-forward reconstruction across static and dynamic scenes, and studying whether the learned geometric representations transfer beyond reconstruction.
14
142
851
776,952
Junyi Zhang retweeted
Two months ago, I vaguely posted a number: 0.9 FID, one-step, pixel space. Now it is 0.75, and can be even lower. Many wonder how. I thought it might end as a small FID prank: simple and deliberate. It started with one question: can FID be optimized directly, and what does it reveal? Introducing FD-loss.
56
157
954
229,251
Junyi Zhang retweeted
๐Ÿ† Our VisGym just got the โœจbest paper awardโœจ at the multimodal intelligence workshop in ICLR :)
4
7
66
4,543
Context distillation for geometric models, very cool idea
4D vision is particularly challenging for 3D foundation models due to the scarcity of 4D data. In SelfEvo, we ask: can a model learn purely from itself? It works remarkably well, even in scenarios where annotations are nearly impossible to obtain.
3
21
4,149
Junyi Zhang retweeted
Our paper was selected as an oral presentation in #CVPR2026
We present the SOTA feed-forward 3DGS pipeline Selfi, which was accepted by #CVPR2026 Project Page: denghilbert.github.io/selfi
5
14
139
10,330
Junyi Zhang retweeted
Whatโ€™s the right representation for a world model? 3D, pixels, or something else? Excited to release our new paper โ€œForecasting Motion in the Wildโ€ where we propose point tracks as tokens for generating complex non-rigid motion and behavior From @GoogleDeepmind @Berkeley_AI @TTIC_Connect
7
74
470
80,682
Junyi Zhang retweeted
Robotics: coding agentsโ€™ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io ๐Ÿงต
20
126
632
168,788
Junyi Zhang retweeted
Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos. ๐Ÿ“„ arxiv.org/abs/2603.12254 ๐ŸŒ autogaze.github.io ๐Ÿค— huggingface.co/collections/bโ€ฆ (1/n)๐Ÿงต
47
203
1,577
158,575
Junyi Zhang retweeted
๐—ž-๐—บ๐—ฒ๐—ฎ๐—ป๐˜€ ๐—ถ๐˜€ ๐˜€๐—ถ๐—บ๐—ฝ๐—น๐—ฒ. ๐— ๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐—ถ๐˜ ๐—ณ๐—ฎ๐˜€๐˜ ๐—ผ๐—ป ๐—š๐—ฃ๐—จ๐˜€ ๐—ถ๐˜€๐—ปโ€™๐˜. Thatโ€™s why we built Flash-KMeans โ€” an IO-aware implementation of exact k-means that rethinks the algorithm around modern GPU bottlenecks. By attacking the memory bottlenecks directly, Flash-KMeans achieves 30x speedup over cuML and 200x speedup over FAISS โ€” with the same exact algorithm, just engineered for todayโ€™s hardware. At the million-scale, Flash-KMeans can complete a k-means iteration in milliseconds. A classic algorithm โ€” redesigned for modern GPUs. Paper: arxiv.org/abs/2603.09229 Code: github.com/svg-project/flashโ€ฆ
36
200
1,747
307,355
Junyi Zhang retweeted
We're very excited to present a new hybrid memory version of feed-forward geometric reconstruction! The core intuition is that our architectures should be designed with type of training data we have available in mind. The result is very long (kilometer-scale) reconstruction!!
๐—ข๐—ป๐—ฒ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—ฐ๐—ฎ๐—ปโ€™๐˜ ๐—ฟ๐˜‚๐—น๐—ฒ ๐˜๐—ต๐—ฒ๐—บ ๐—ฎ๐—น๐—น. We present ๐—Ÿ๐—ผ๐—š๐—ฒ๐—ฅ, a new ๐—ต๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† architecture for long-context geometric reconstruction. LoGeR enables stable reconstruction over up to ๐Ÿญ๐Ÿฌ๐—ธ ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜€ / ๐—ธ๐—ถ๐—น๐—ผ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ, with ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ-๐˜๐—ถ๐—บ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด in sequence length, ๐—ณ๐˜‚๐—น๐—น๐˜† ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ณ๐—ผ๐—ฟ๐˜„๐—ฎ๐—ฟ๐—ฑ inference, and ๐—ป๐—ผ ๐—ฝ๐—ผ๐˜€๐˜-๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป. Yet it matches or surpasses strong optimization-based pipelines. (1/5) @GoogleDeepMind @Berkeley_AI
1
4
109
16,331
๐—ข๐—ป๐—ฒ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—ฐ๐—ฎ๐—ปโ€™๐˜ ๐—ฟ๐˜‚๐—น๐—ฒ ๐˜๐—ต๐—ฒ๐—บ ๐—ฎ๐—น๐—น. We present ๐—Ÿ๐—ผ๐—š๐—ฒ๐—ฅ, a new ๐—ต๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† architecture for long-context geometric reconstruction. LoGeR enables stable reconstruction over up to ๐Ÿญ๐Ÿฌ๐—ธ ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜€ / ๐—ธ๐—ถ๐—น๐—ผ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ, with ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ-๐˜๐—ถ๐—บ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด in sequence length, ๐—ณ๐˜‚๐—น๐—น๐˜† ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ณ๐—ผ๐—ฟ๐˜„๐—ฎ๐—ฟ๐—ฑ inference, and ๐—ป๐—ผ ๐—ฝ๐—ผ๐˜€๐˜-๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป. Yet it matches or surpasses strong optimization-based pipelines. (1/5) @GoogleDeepMind @Berkeley_AI
64
446
3,399
560,327
LoGeR breaks both walls with ๐—ฐ๐—ต๐˜‚๐—ป๐—ธ-๐˜„๐—ถ๐˜€๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด ๐—ต๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—บ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜†: ๐Ÿ”น Local Memory (SWA): non-parametric, lossless sliding-window attention preserves high-fidelity adjacent alignment. ๐Ÿ”น Global Memory (TTT): compressed fast weights propagate long-range structure and stabilize scale over kilometer-scale trajectories.
1
54
11,044
Check out the project page for more details! ๐ŸŒ Webpage: loger-project.github.io/ ๐Ÿ“„ Paper: arxiv.org/abs/2603.03269 Yet another wonderful collaboration with this amazing team: @CharlesHerrman8* @JunhwaHur* @jesu9 @MingHsuanYang @forrestercole2 @trevordarrell @DeqingSun

4
8
131
10,876