We all knew LLM agents struggle to explore, but we had to eyeball it ๐. We couldn't measure exploration errors. Until now. ๐บ๏ธ๐ค
We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents.
Spoiler: Exploration error is what kills๐ agent performance in our setting ๐๐งต(1/8)
The reversal curse. Edits that don't suppress negations. Multi-hop updates that don't propagate. These look like separate bugs.
Our ICML 2026 spotlight argues they may share a common geometric origin, visible only when you study how representations move under updates ๐งต
(1/11)
Almost all "flagship" models are now MoEs.
But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter.
So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch?
๐งต๐
Lots of good news this week! ๐
1. My internship project from @AdobeResearch has been accepted to #SIGGRAPH2026!
("MAOAM: Unified Object & Material Selection with Vision-Language Models")
Special thanks to my wonderful mentor @michi_fischer who has made this project possible!
2. Paper accepted to #ICML2026!
("DocHop: Benchmarking Out-of-domain Multi-hop Reasoning in Information-Dense Documents")
3. Paper accepted (with minor revisions) at #DMLR!
("Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models")
In both papers, we generate carefully designed benchmarks to tackle compositional/multi-hop reasoning in VLMs. Proud to have contributed in these projects.
More detailed posts soon :) Stay tuned!
We all knew LLM agents struggle to explore, but we had to eyeball it ๐. We couldn't measure exploration errors. Until now. ๐บ๏ธ๐ค
We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents.
Spoiler: Exploration error is what kills๐ agent performance in our setting ๐๐งต(1/8)
I will be at #ICLR2026 to present my work on data contamination in VLMs! (Fri, Apr 24, 2026 โข 8:30 AM โ 11:00 AM, Pavilion 3 P3-917)
I am currently interested in VLA/physical AI, agents and robustness/generalization.
Would love to chat and connect with anyone with similar interests :)
Me: memorize past exams ๐๐ฏ
Also me: fail on a slight tweak ๐คฆโโ๏ธ๐คฆโโ๏ธ
Turns out, we can use the same method to ๐ฑ๐ฒ๐๐ฒ๐ฐ๐ ๐ฐ๐ผ๐ป๐๐ฎ๐บ๐ถ๐ป๐ฎ๐๐ฒ๐ฑ ๐ฉ๐๐ ๐! ๐งต(1/10)
- Project Page: mm-semantic-perturbation.gitโฆ
We all knew LLM agents struggle to explore, but we had to eyeball it ๐. We couldn't measure exploration errors. Until now. ๐บ๏ธ๐ค
We built a policy-agnostic metric to quantify exploration and exploitation errors in LLM agents.
Spoiler: Exploration error is what kills๐ agent performance in our setting ๐๐งต(1/8)
Can we improve exploration failures in LM agents? ๐ ๏ธ
๐บ๏ธ Exploration Prompts: Explicitly injecting exploration strategies increases success rate by 17%.
๐ Explicit Harness: Providing the agent with structured summaries of its past observations; success rate boost by 29.4%! ๐งต(7/8)
Excited to be back at @AdobeResearch this summer where I will be working with @Shramanpramani2 :)
Would love to connect with anyone who will be around!
๐ฅ Upgrade your frozen vision encoders with <10 lines of code!
Single-scale inference throws away vital details. Enter MuRF ๐: a simple, training-free plug-in for instant, massive gains in MLLMs, Seg & Depth. ๐คฏ 1/6
๐จNew work with @Meta@RealityLabs
We introduce EGAgent, an agentic reasoning framework for very long video understanding powered by entity scene graphs
Why? With long multimodal data streams, agents must search and reason across multiple modalities!
๐งต (1/n)
New paper out! ๐จ Introducing STTS: Unified Spatio-Temporal Token Scoring for Efficient Video VLMs. We tackle the massive token bottleneck in video models by jointly identifying the tokens that actually matter. The overall figure below breaks down the core problem! ๐งต๐
Hi ML Twitter!
My Summer 2026 internship unfortunately fell through last minute ๐ตโ๐ซ
If your team is looking for interns, Iโd love to connect - RTs appreciated ๐
My website: aniketrege.github.io/
There should be a meta-conference where reviewers are Claude Code.
(1) Claude Code figures out how to run your code like TerminalBench.
(2) Claude Code tries to run your code for 48 hours.
If Claude Code can't beat your Table 1 in 2 days of vibe-research, it gets accepted โ .