Saying LLMs are "just next-token predictors" dramatically understates what they actually do.
Yes, they predict the next token, but they don't do it one word at a time without thinking ahead. During training, they learn to predict entire sequences, using all the previous context and optimizing for many future tokens at once. In practice, they are constantly planning several steps ahead.
Also, they're trained on an enormous variety of data—not just human-written text, but code, financial data, weather reports, scientific papers, logs, and much more. To make accurate predictions across all these domains, they end up learning the underlying patterns and structures that generate them.
For example, predicting weather-related text requires understanding concepts like geography, seasons, sunlight, and climate cycles. Predicting code requires understanding programming logic. Predicting financial discussions requires understanding economic behavior.
A better way to think about LLMs is not as "next-token predictors" but as general-purpose simulators. Give them a prompt, and they simulate the world, system, or domain implied by that prompt.
It's kind of unbelievable that this works as well as it does—but somehow, it does.