π¨ Check out MINTEval, a new *memory interference* benchmark to stress-test agentic memory systems on:
π frequent & interfering context changes (avg. 86 updates)
π over long horizons (avg. 138.8k-token contexts, up to 1.8M)
π 5 challenging question types (incl. long-range recovery, multi-target reasoning)
π 4 realistic domains (state tracking, multi-turn dialogue, wikipedia revisions, code commits)
π Across 7 representative systems (Full Context, RAG-based, and Memory-Augmented Agents), the best performance is only 33.4%!
Other interesting findings:
π Memory construction failures are a major bottleneck
π Memory agents are highly sensitive to design choices
π Systems strongly favor insertion over deletion/update operations
π§΅π
LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this?
We introduce MINTEval:
β
Frequent context changes & interference (avg. 86 updates)
β
5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context
β
4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits
β
Avg. 138.8k tokens per instance (up to 1.8M)
β
Human verification on generated QAs = 95.6%
π Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%.
π Our analysis shows:
β’ Memory construction failures cause a 41.7% drop
β’ Memory agents are highly sensitive to design choices
β’ Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update