1/6 📊 UPDATED EVAL RESULTS
We compared Gemini 3 Pro, Claude Opus 4.5, and GPT 5.1 on a single investigation task of our internal agent eval for Security Operations tasks.
Key Results:
-
@OpenAI GPT-5 models maintain the performance-cost Pareto frontier
-
@AnthropicAI Opus 4.5 completed tasks 2x faster on average than any other tested model, including Haiku 4.5 (!), suggesting that model reasoning capability and efficiency can outweigh raw inference latency in long-horizon tasks
-
@GoogleDeepMind Gemini 3 Pro helps Google close the gap to other leading frontier models, but still lags behind in performance and reliability
The task is a
@splunk BOTSv3 CTF environment built to test frontier models' capability on realistic blue team cybersecurity tasks.
BOTSv3 comprises over 2.7M logs (spanning over 13 months) and 59 Question and Answer pairs that test scenarios such as investigating cloud-based attacks (AWS, Azure) and simulated APT intrusions.
See results and blog post in the thread below