Do LLMs understand or are they just imitating?
The debate about whether LLMs truly understand has long been stuck in a dead end. Some argue that itās Ā«just statisticsĀ», while others claim there are already seeds of a mind inside. The preprint discussed here suggests stepping out of this stalemate and reframing the question: what kind of understanding can exist inside a model, and through which mechanisms does it arise?
The key idea is: understanding is the ability to see connections - between objects, properties, states, and rules. Mechanistic interpretability finally provides tools to examine whether such connections exist inside a model itself, rather than only in its outward answers.
The authors propose viewing understanding as a multi-level structure.
At the most basic level, a model forms internal concepts. These are not words or definitions, but stable «directions» in its internal space that activate across different manifestations of the same thing. Different phrasings, hints, or contexts pointing to the same object or idea can trigger the same internal feature. This goes beyond token matching: the model is able to unify variation into something shared.
The next level is understanding the state of the world. Here itās no longer just about concepts, but about relationships between them and how those relationships change over time. The clearest example is models trained to play Othello that never Ā«seeĀ» the board, receiving only a sequence of moves. Analysis shows that they internally construct a representation of the current game state - where pieces are, which squares are occupied, which are free. Moreover, if you intervene directly in this internal representation, the modelās behavior changes in a predictable way. This no longer looks like memorizing patterns. It looks like maintaining an internal world model.
But an important caveat follows: having such a model does not mean it is always used. The authors emphasize an uncomfortable but crucial point - models tend to switch to cheaper heuristics when those are sufficient. Even when «real» understanding is available, it does not have to be activated.
The highest level is principled understanding. This is when a model does not merely know examples, but implements a compact rule or algorithm that generalizes the task. A classic example is the phenomenon of grokking in tasks like modular addition. For a long time the model overfits, achieving perfect training accuracy while failing on the test set - until suddenly it starts solving everything. Analysis shows that at this moment, what emerges inside is not a lookup table but a structured solution - for example, representing numbers as angles on a circle and performing addition through operations equivalent to trigonometric identities. This is no longer «memorization», but a discovered principle.
At the same time, the authors are honest: such principles are usually crystallized through training, not derived on the fly. This is why humans still outperform LLMs on tasks that require quickly inferring a new rule from just a few examples, such as ARC-AGI.
The final conclusion of the paper is perhaps the most important. An LLM is not a unified mind or a coherent thinking system. It is a motley mixture of mechanisms that coexist and compete. Sometimes a structural solution wins, sometimes a superficial heuristic does. Sometimes the model shows impressive understanding, and sometimes it stumbles on seemingly simple problems - simply because the «cheap path» turned out to be stronger.
There are structures inside modern models that closely resemble understanding, but they do not form a single, reliable, self-regulating mind. And so the real question is not whether an LLM understands, but which type of understanding was activated in a given moment and what, exactly, overrode it.
arxiv. org/abs/2507.08017