2/, We tested the world’s leading coding agents, and the results are a wake-up call for the industry:
Functionality ≠ Security: For example, while SWE-Agent with Claude 4 Sonnet solved 61% of tasks correctly, only 10.5% of those solutions were actually secure.
🚀 Is "Vibe Coding" actually safe for production?
We’ve all seen the demos: give an LLM agent a prompt, watch it work its magic, and boom—you have a feature. But there’s a massive hidden risk.
In our latest paper, we introduce SUSVIBES, a benchmark of 200 real-world SE tasks.