VLMs Systematically Fake Visual Understanding
Even when VLMs appear to be good at visual understanding, most of their answers are not actually grounded in the image (hallucinated!).
We identify two types of hallucinations that appear in up to 98% of answers that seem to demonstrate visual understanding.
First, textual biases. The model answers using language patterns, information in the question, and knowledge learned during training, without engaging its visual representations.
Second, spurious images. The model constructs false visual content inside its internal representation and then answers as if this imagined content were grounded in the real image.
In both cases, the answers may still be correct, but they are not grounded in the visual input at all!!