agree broadly with the thesis, but it's incomplete. data scarcity is only part of the problem.
llms didn't learn to reason because the internet wrote reasoning down. chain-of-thought on the web before 2023 was minimal. they learned when post-training started using verifiable rewards (a math grader, a code runner, a unit test) to score intermediate steps.
vlms have a similar gap, imo. there is no visual analog of the code runner. no verifiable check that asks "did the model actually see X in an image, or did it just say X based on the image context."
if agentic vision were to take off, we should be build methods for verifiability. once a vision rl loop has a check for grounding, the diff vision skills become a normal post-training problem with rewards.
Since founding Moondream, I've watched language models achieve AGI, while VLMs aren't close to human-level visual reasoning. Here's why. 🧵