🚀 How well can LLMs know you and personalize your response? Turns out, not so much!
Introducing the PersonaMem Benchmark --
👩🏻💻Evaluate LLM's ability to understand evolving persona from 180 multi-session user-chatbot conversation history
🎯Latest models (GPT-4.1, GPT-4.5, o4-mini, Llama-4, Gemini 2.0, Deepseek-R1, Claude-3.7) all struggle in personalization!
🎨7 personalization skills tested in 15 scenarios
🌟Realistic long-context evaluation up to 1M tokens
👇 Check out what we discovered… (1/6)
ALT Fig 1: Overview of PersonaMem benchmark. Each benchmark sample is a user persona with static (e.g., demographic info.) and dynamic attributes (e.g., evolving preferences). Users engage with a chatbot in multi-session interactions across a variety of topics such as food recommendation, travel planning, and therapy consultation. As the user’s preferences evolve over time, the benchmark offers annotated questions assessing whether models can track and incorporate the changes into their responses. Fig 2: Model performances by number of sessions elapsed since most recent preferences were mentioned in long context. Top: up to 20 sessions/128k tokens; Bottom: up to 60 sessions/1M tokens. Long-context retrieval is important for personalization in practice.