It’s clear VLMs still struggle across diverse visual inputs. Introducing WorldBench, our new VLM benchmark with visually diverse, realistic, and challenging VQA questions! All questions are annotated *by hand* and checked meticulously for quality. Try it out for yourself!
Today’s vision benchmarks suggest VLMs are nearing saturation, but real-world visual understanding is far from solved.
Introducing WorldBench: 2,000 hand-written, human-verified VQA questions focused on visual diversity and designed to be challenging for frontier models. Gemini-3.1-Pro leads with just 64.0% accuracy. (1/10)