𝐓𝐡𝐞 𝐁𝐢𝐭𝐭𝐞𝐫 𝐋𝐞𝐬𝐬𝐨𝐧 𝐨𝐟 𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐌𝐞𝐦𝐨𝐫𝐲: memory should be a derived capability that exists because it makes an agent better at acting over time.
𝐖𝐨𝐫𝐥𝐝𝐌𝐞𝐦𝐀𝐫𝐞𝐧𝐚 is designed around this principle. Rather than evaluating memory as a storage problem, WorldMemArena evaluates memory through 𝐚𝐜𝐭𝐢𝐨𝐧–𝐰𝐨𝐫𝐥𝐝 𝐢𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧, instrumenting the full write → maintain → retrieve → use lifecycle across 400 multimodal, multi-session tasks.
And it exposes the findings that should mark the end of the storage-centric era:
→ Storage ≠ use. Better memory storage and retrieval do not necessarily produce better task performance. Optimizing the component we designed does not optimize the capability we actually care about.
→ Harness-based memory performs best where memory is hardest. Agents that can write files, reorganize context, create artifacts, and interact with persistent environments adapt most effectively in long-horizon settings. They are costly and unstable today, which is exactly what many Bitter Lesson transitions look like before scaling and learning take over.
The deeper move is in what gets measured. Memory shouldn't get a score; it should be inferred from capability: how much does remembering improve performance over time.
WorldMemArena drags evaluation off the static object and into the action–world loop, the only place you can tell whether an agent has developed memory or is just simulating it convincingly.
🤔It is time to rethink how we evaluate agent memory
🌍 As agents become longer horizon and more autonomous, memory is no longer just a module for storing past chats.
🛠️ It determines how agents track changing worlds, learn from past actions, revise outdated information, and reuse experience for future decisions.
🔍 This raises three key questions:
Are human designed write store retrieve memory pipelines still the best choice?
If harnesses such as Codex, Claude Code, and OpenClaw already let agents observe, act, call tools, write files, and reorganize context, can memory also be managed by the harness itself?
Do current evaluations really cover how agent memory is used in realistic settings? Many benchmarks are still text centric or single modal, with limited pressure from screenshots, GUIs, tool feedback, and environment changes.
❓ Is final QA accuracy enough?
🔥 We present WorldMemArena, a multimodal benchmark for evaluating agent memory through action world interaction.
📌 Key insights:
🧩 Memory is a lifecycle, not a static cache.
📉 Better memory storage does not necessarily lead to better final performance.
🖼️ Multimodal memory remains a major bottleneck for current systems.
🌍 Real agentic trajectories expose the fragility of memory systems.
⚙️ Harness-based memory is more flexible, but still costly and unstable.