This benchmark costs over $120k in API spend and 16k expert hours.
DecodingTrust-Agent Platform (DTap) is by far the most realistic agent red-teaming setup with 50 simulated environments (Gmail, PayPal, Slack, Salesforce, Robinhood, Windows, macOS, etc.), full GUI/backend, and MCP tools mirroring the real ones.
DTap benchmarks in simulated environments with separated tools, skills, and prompts.
Each simulated environment is a full-stack replica with real frontend, backend, and database. Take Robinhood, for example. DTap rebuilds the trading dashboard, the order APIs, and the portfolio state all 1:1 with the real product. Plus you can reload any environment state on demand, and run thousands of evaluations in parallel. Most agent benchmarks fake this layer with hardcoded tool outputs.
DTap does not just benchmark what to inject, but also where to inject.
Most prior agent benchmarks (AgentDojo, AgentHarm) only attack the user prompt with hardcoded injections. They're clean to measure, but tell you nothing about whether your real Gmail agent is exploitable.
DTap treats location as a choice.
For example, to get an agent to leak your private inbox to an attacker, the attack might plant a fake email thread that makes the agent think you approved forwarding messages to an outside address. It might poison the description of an MCP tool the agent picks up at runtime. Or hide instructions inside an image attachment that the agent parses and executes. This is better because real attackers don't pick one surface and stop — they search for whichever path is least defended. A benchmark that only tests prompt injection might call your agent safe, but a poisoned tool description may still breach the system.
DTap uses a real risk taxonomy.
300 risk categories are pulled from 60 real policies (Salesforce AUP, EU AI Act, GDPR, NIST). So Attack Success Rate (ASR) measures whether the agent actually broke a real rule — like leaking data covered by GDPR or making an unauthorized PayPal transaction — not just whether someone got the model to say something bad. That's much closer to a real security claim than a typical jailbreak leaderboard.
DTap ditched LLM-as-judge.
Each task comes with a small piece of code, written by hand by the researchers, that inspects the environment after the attack runs. For example, on a PayPal task where the goal is "make an unauthorized $500 transfer to the attacker's account," the rule queries the sandbox transaction database after the attack and checks if a new transaction to that account for $500 appeared. Every task uses the same deterministic state checks approach, which honestly makes a lot more sense.
The findings are more interesting (and concerning) than you'd expect:
1. Even Claude Code — the most robust one tested — falls to 25% of attacks. Google ADK loses to more than half.
2. Combining different injection points works much better than attacking just one. And Skill Tool and Environment Tool combinations consistently beat any single-channel attack.
3. The most exploitable environments are the ones with rich communication flows like Gmail, WhatsApp, and Calendar, where there's a lot of external content for an attacker to slip into.
4. The risks that hit hardest are the ones requiring multi-step reasoning, while content-level risks like generating harmful text are mostly already handled by model alignment.
Another finding that's largely been overlooked: harness design matters as much as model alignment, if not more.
As a comparison, OpenAI Agents SDK and Google ADK let the agent fire several tool calls at the same time, then only check afterward whether any of them should have been refused. By that point the harmful action — deleted file, sent email, executed transaction — has already happened. On the other hand, Claude Code and OpenClaw call tools one at a time, so the agent can spot the problem and stop before any damage is done.
Worth a real read:
arxiv.org/pdf/2605.04808