Research Scientist working on RL environments and evals @PatronusAI | ex-Research @USC_ISI

Joined July 2020
Photos and videos
Pinned Tweet
RL coding agents increasingly game rewards by exploiting their semantic and syntactic weaknesses. Can LLMs detect such behaviors from live training rollouts? We find contrastive cluster analysis is key! ๐Ÿš€ GPT-5.2 jumps from 45% to 63%. Humans reach 90% Paper data ๐Ÿงต
1
3
8
1,408
Regardless of how much I support this, I don't believe that the @arxiv moderation team has enough capacity or incentive to evaluate these AI flagged papers quickly. My recent non-AI-generated submission has been "on-hold" and is awaiting moderator response for over 2 weeks now!
Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/
1
121
Multi-turn research benchmarks are hard to build because trajectories and tool usage are non-deterministic. We make this process more deterministic by introducing a memory-grounded benchmark of tip-of-tongue, multi-turn, multimodal queries for evaluating VLMs. Check out DETOUR!
Spotlighting our newest benchmark for agentic search: DETOUR When people try to recall something in conversation, they rarely give a perfect query upfront. They say things like โ€œthat movie with the scene whereโ€ฆโ€ or โ€œthe paper aboutโ€ฆโ€ and the assistant has to ask the right follow-up questions to get there. Existing search and agent benchmarks often miss this multi-turn, tip-of-the-tongue behavior. To more realistically evaluate it, we introduce DETOUR: Dual-agent based Evaluation Through Obscure Under-specified Retrieval, an interactive benchmark for dual-agent search and reasoning. DETOUR contains 1,011 prompts across text, image, audio, and video. In the benchmark, a Primary Agent is evaluated on its ability to identify a target entity by querying a consistent Memory Agent, testing whether models can resolve ambiguity through useful follow-up questions. Current state-of-the-art models still struggle: performance reaches only 36% accuracy across all modalities, showing that todayโ€™s agents remain weak at clarification-seeking in underspecified, real-world search settings. We hope DETOUR helps push the next generation of search agents toward better reasoning, better questions, and more robust multi-turn retrieval. arXiv Paper: arxiv.org/abs/2602.00352
1
2
203
TRACE made it into ICML 2026 Main Track! Hopefully inspiring more research in this crucial space๐Ÿฅณ
RL coding agents increasingly game rewards by exploiting their semantic and syntactic weaknesses. Can LLMs detect such behaviors from live training rollouts? We find contrastive cluster analysis is key! ๐Ÿš€ GPT-5.2 jumps from 45% to 63%. Humans reach 90% Paper data ๐Ÿงต
1
1
7
926
RL coding agents increasingly game rewards by exploiting their semantic and syntactic weaknesses. Can LLMs detect such behaviors from live training rollouts? We find contrastive cluster analysis is key! ๐Ÿš€ GPT-5.2 jumps from 45% to 63%. Humans reach 90% Paper data ๐Ÿงต
1
3
8
1,408
Where do models fail? ๐Ÿค” - Semantic reward hacks are harder to detect than syntactic hacks! - Models consistently show similar failures QA reveals: โœ… Grounding and exploring consequences helps โŒ Over-reliance on user acceptance or self awareness patterns impact performance
1
3
142
Darshan Deshpande retweeted
๐Ÿ‘‹ Folks at #NEURIPS2025, come check out & stop by the poster of our Memtrack env at the SEA workshop happening at Upper Level 23ABC, 3:50pm onwards. Our env studies how well an agent dropped into a workplace can context engineer by composing tool calls to access intertwined slack, linear & git timelines in pursuit of answering a battery of related questions. Full paper arxiv: arxiv.org/abs/2510.01353
๐ŸšจWe will be presenting Memtrack today at the SEA workshop from 3:50pm onwards at #NeurIPS2025 Memtrack is a SoTA eval env to study an agent's ability to memorize and retrieve facts using exploration over interleaved enterprise slack, linear and git threads in a multi-QA setting
3
8
894
๐ŸšจWe will be presenting Memtrack today at the SEA workshop from 3:50pm onwards at #NeurIPS2025 Memtrack is a SoTA eval env to study an agent's ability to memorize and retrieve facts using exploration over interleaved enterprise slack, linear and git threads in a multi-QA setting
4
13
2,584
Creating a bounty program out of benchmark datasets that restrict training on to then create RL environments that can be trained on using Prime's "open source" training services. This is scammy practice under the name of open science!
27 Oct 2025
if you or a loved one is looking to learn about building environments and get a bag in the process, inquire within our bounty list is bigger and better than ever
3
9
4,454
Excited to have contributed to OpenEnv before its release today! Thanks to @Meta and @huggingface for working towards standardizing RL environment creation!
23 Oct 2025
Weโ€™re excited to support @Meta and @huggingface's OpenEnv launch today! OpenEnv provides an open-source framework for building and interacting with agentic execution environments. This allows researchers and developers to create isolated, secure, deployable, and usable environments. Lately, at Patronus, weโ€™ve been working on RL environments for coding agents, and we were excited to contribute to OpenEnv with real-world-inspired tools and tasks to train and steer AGI. We began with a Gitea-based git server environment. Git server environments are foundational and enable effective collaboration and version control for software workflows, and we thought it would be a perfect way to get started with OpenEnv. With our git server environment, we support: * Fast iteration across runs with sub-second resets for RL training loops * Shared server isolated workspaces * Environment variables setting custom configs for Gitea We look forward to seeing what everyone builds with OpenEnv! GitHub: github.com/meta-pytorch/Openโ€ฆ HuggingFace: huggingface.co/openenv
2
214
Darshan Deshpande retweeted
6 Aug 2025
Thank you, @BerkeleyRDI, for hosting the Agentic AI Summit and having us! @getdarshan, one of our research scientists, who leads agent evaluation here at Patronus, presented at the summit! Here are a few takeaways: * Given context explosion and increasing domain depth and specificity, we are approaching a ๐—ป๐—ฒ๐˜„ ๐—ฎ๐—ด๐—ฒ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜๐˜‚๐—ฎ๐—น ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€. * As the AI we work with becomes exponentially better, nuances become more important, as does the ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐—ด๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ๐—ฒ๐—ฑ ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ-๐—ฑ๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ป ๐—ฒ๐—ป๐˜ƒ๐—ถ๐—ฟ๐—ผ๐—ป๐—บ๐—ฒ๐—ป๐˜๐˜€. * Agent evaluation and ๐—ฒ๐˜…๐—ฝ๐—น๐—ฎ๐—ถ๐—ป๐—ฎ๐—ฏ๐—น๐—ฒ ๐—”๐—œ go hand-in-hand. Explainable agents are optimal for understanding agent workflows, fixing errors, and improving trajectories. * Our team has seen success in developing ๐—ต๐—ฎ๐—ฟ๐—ฑ, ๐—ฑ๐—ผ๐—บ๐—ฎ๐—ถ๐—ป-๐˜€๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ณ๐—ถ๐—ฐ, ๐—ฎ๐—ป๐—ฑ ๐—ป๐—ผ๐˜ƒ๐—ฒ๐—น ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€ to rigorously evaluate AI performance. ๐˜ ๐˜ฐ๐˜ถ ๐˜ค๐˜ข๐˜ฏ ๐˜ณ๐˜ฆ๐˜ข๐˜ฅ ๐˜ฎ๐˜ฐ๐˜ณ๐˜ฆ ๐˜ข๐˜ฃ๐˜ฐ๐˜ถ๐˜ต ๐˜‹๐˜ข๐˜ณ๐˜ด๐˜ฉ๐˜ข๐˜ฏโ€™๐˜ด ๐˜ณ๐˜ฆ๐˜ค๐˜ฆ๐˜ฏ๐˜ต ๐˜ธ๐˜ฐ๐˜ณ๐˜ฌ ๐˜ฉ๐˜ฆ๐˜ณ๐˜ฆ: * ๐—ง๐—ฅ๐—”๐—œ๐—Ÿ: ๐—” ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป patronus.ai/blog/introducingโ€ฆ * ๐—•๐—Ÿ๐—จ๐—ฅ: ๐—” ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—ง๐—ถ๐—ฝ-๐—ผ๐—ณ-๐˜๐—ต๐—ฒ-๐—ง๐—ผ๐—ป๐—ด๐˜‚๐—ฒ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฎ๐—ป๐—ฑ ๐—ฅ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด patronus.ai/blog/the-blur-beโ€ฆ * ๐—š๐—Ÿ๐—œ๐——๐—˜๐—ฅ: ๐—ฆ๐—ผ๐—ง๐—” ๐—ฆ๐—Ÿ๐—  ๐—๐˜‚๐—ฑ๐—ด๐—ฒ patronus.ai/blog/glider-statโ€ฆ Reach out if youโ€™re interested in chatting more about agent evals and how we can collaborate! #BerkeleyRDI #AgenticAISummit
1
2
300
Non-deterministic trajectories need autonomous supervision. Introducing Percival, a SoTA system to detect issues with long context agentic problems and suggest fixes to systems. The time to make a move towards autonomous evaluations is now! ๐Ÿ”ฅ
14 May 2025
1/ ๐Ÿ”ฅ๐Ÿ”ฅ Big news: Weโ€™re launching Percival, the first AI agent that can evaluate and fix other AI agents! ๐Ÿค– Percival is an evaluation agent that doesnโ€™t just detect failures in agent traces โ€” it can fix them. Percival outperformed SOTA LLMs by 2.9x on the TRAIL dataset, containing human annotated errors from GAIA and SWE-Bench. ๐Ÿฆพ Hereโ€™s what Percival can do for you: - Automatically suggest prompt fixes for your agent - Catch 20 types of agent failures spanning tool use, planning and coordination, domain specific errors - Reduce manual debugging time from hours to < 1 minute
1
3
10
1,078
Darshan Deshpande retweeted
2 Apr 2025
We're excited to introduce the BLUR Leaderboard on @huggingface ๐Ÿ”ฅ Earlier today, we open sourced BLUR: the first agent benchmark for tip-of-the-tongue search and reasoning. It measures how effectively agents can help you identify something you vaguely remember, but canโ€™t quite name. Check out the leaderboard on @huggingface to see how SOTA systems perform on BLUR! Most systems score below 50% ๐Ÿ˜ฒ huggingface.co/spaces/Patronโ€ฆ
2
11
42
18,076
Darshan Deshpande retweeted
2 Apr 2025
1/ Ever tried to remember the name of a movie youโ€™ve seen โ€“ you can picture the scenes clearly, but the movie name wonโ€™t come to you? Introducing BLUR: the first agent benchmark for tip-of-the-tongue search and reasoning ๐Ÿ”ฅ We benchmarked SOTA agents and found that the best-performing agent only scored 56% on BLUR, while humans scored nearly perfectly at 98%! ๐Ÿคฏ - OpenAI Operator: 54% - Perplexity Pro: 27% - ChatGPT-4o: 49% - DeepSeek-R1: 41% ArXiv: arxiv.org/pdf/2503.19193 HuggingFace data sample: huggingface.co/datasets/Patrโ€ฆ Patronus dataset: app.patronus.ai/datasets Learn about our approach below ๐Ÿ‘‡
1
6
46
11,093
While experimenting with alignment methods, we observed that APO was more robust to noise in synthetic training data as compared to DPO or KTO. Thanks for the excellent contribution to the community @KarelDoostrlnck and team ๐Ÿš€
23 Dec 2024
Happy to see @PatronusAI use our Anchored Preference Optimization (APO) objective in their study!
1
1
6
598
I'm calling it right now - distilling reasoning chains is going to be the next big thing! โ›๏ธ @OpenAI #OpenAi #o3
1
96