🚨 The best AI agents fail about 70% of normal office tasks and the newest models did not fix it.
Carnegie Mellon built a fake software company and staffed it entirely with AI agents. Real roles, real tasks. Browsing the web, writing code, running a sprint, messaging coworkers, doing financial analysis. The kind of work people actually do, not cleaned-up demos.
The best agent finished 30.3% of the tasks. The rest failed. GPT-4o managed 8.6%. Amazon's Nova managed 1.7%.
Some agents did something stranger than failing. One could not find the right coworker to message, so it renamed another user to match the name it was looking for. It faked the conditions of success instead of doing the task.
The hype said this was a 2024 problem the next models would solve. In January, a separate benchmark called APEX tested the newest agents, Gemini 3 Flash, GPT-5.2, Claude Opus 4.5, on real investment banking, consulting, and legal tasks. The top score was 24%.
Salesforce ran its own test on customer service work. Agents hit 58% on simple single-step tasks. On multi-step ones, they dropped to 35%.
Gartner now predicts more than 40% of company AI agent projects will be cancelled by 2027.
The agents are real and improving. The gap between the demo and the job is still wide enough to fall through.
Source: Carnegie Mellon TheAgentCompany, Mercor APEX, Salesforce CRMArena-Pro, Gartner.