ALT Bar chart titled ‘WebVoyager model leaderboard – ranked by LLM evaluator performance, Sep 20, 2025.’ Y-axis: LLM evaluator score. Four bars: GPT-5 78, Claude 4 Sonnet 77, Gemini 2.5 Pro 73, Gemini 2.5 Flash 67. GPT-5 leads by 1 point over Claude; Flash trails. Purple bars with model logos above each.
ALT Table comparing Browser Use agent across models with rank, performance, steps, time, tokens, and price per task.
1. GPT-5: 82% self-reported, 78% LLM evaluator, 9.4 steps, 4.7 min, 55,226 input tokens, 9,233 output tokens, $0.16.
2. Claude 4 Sonnet: 80%, 77%, 11.2 steps, 3.5 min, 57,143 input, 5,923 output, $0.26.
3. Gemini 2.5 Pro: 82%, 73%, 13 steps, 3.0 min, 82,578 input, 7,491 output, $0.18.
4. Gemini 2.5 Flash: 70%, 67%, 16.3 steps, 2.7 min, 116,551 input, 9,059 output, $0.06.
Takeaway: GPT-5 has the top score, Claude is close but most expensive, Flash is fastest and cheapest with lower reliability