For context, a non-agentic arena tests whether a model can create one simple web page in a single attempt, usually as one HTML file, as is the case for our non-agentic web dev leaderboard.
An agentic arena tests whether a model can work more like a developer over several steps, like creating multiple files and responding to follow-up instructions.
Curiously, we did not notice the same dramatic regression on Design Arena’s agentic full-stack web dev arena, which evaluates multi-file React creations, where users can reprompt, integrated with backend like Supabase, deploy to Vercel, connect to Google Auth, and more.