Predicting the next word "only" is sufficient for language models to learn a large body of knowledge that enables then to code, answer questions, understand many topics, chat, and so on.
This is clear to many researchers now, and there are nice tutorials on why this works by
@ilyasut resorting to compression (
youtube.com/watch?v=AKMuA_TV… ) and by
@geoffreyhinton (
youtube.com/watch?v=iHCeAotH… ).
However, the emergence of types of understanding is not unique to language models. In
arxiv.org/pdf/1804.06318.pdf by
@notmisha and
@brandondamos the authors trained models to predict the next few time stems of over a hundred robot hand sensors (Touch, Gyro, Accelerometer, Joint Info, Actuator Info, etc.). They ten found out that they could regress the shape of the thing the hand was touching from the activations of the neural networks using probes. That is, the model developed an internal representation of shapes even though it was simply used to predict "only" the next few senses. Awareness follows from simple predictions and interaction with the world.