why does this happen? the model believes there's a seahorse emoji, sure, but why does that make it output a *different* emoji? here's a clue from everyone's favorite underrated interpretability tool, logit lens!
in logit lens, we use the model's lm_head in a weird way. typically, the lm_head is used to turn the residual (the internal state built up over the model layers) into a set of token probabilities after the final layer. but in logit lens, we use the lm_head after *every* layer - showing us what tokens the model would output if that layer were the final layer.
for early layers, this results in hard-to-interpret states. but as we move through the layers, the model iteratively refines the residual first towards concepts useful for continuing the text, and then towards the final prediction.
looking at the image again, at the final layer, we have the model's actual output - ĠðŁ, IJ, ł - aka, an emoji byte prefix followed by the rest of the fish emoji.
(it looks like unicode nonsense because of a tokenization quirk - don't worry about it. if you're curious, ask claude about this line of code: `bytes([byte_decoder[c] for c in 'ĠðŁIJł']).decode('utf-8') == ' 🐠'`)
but look what happens in the middle layers - we don't just get emoji bytes! we get those *concepts*, specifically the concept of a seahorse. for example, on layer 52, we get "sea horse horse". later, in the top-k, we get a mixture of "sea", "horse", and that emoji prefix, "ĠðŁ".
so what is the model thinking about? seahorse emoji! it's trying to construct a residual representation of a seahorse emoji.
why would it do that? well, let's look at how the lm_head actually works. the lm_head is a huge matrix of residual-sized vectors associated with token ids. when a residual is passed into it, it's going to compare that residual with each token vector, and in coordination with the sampler, select the token id with a vector most similar to the residual. (more technically: it's a linear layer without a bias, so v @ w.T does dot products with each unembedding vector, then log_softmax and argmax/temperature sample.)
so if the model wants to output the word "hello", it needs to construct a residual similar to the vector for the "hello" output token that the lm_head can turn into the hello token id. and if the model wants to output a seahorse emoji, it needs to construct a residual similar to the vector for the seahorse emoji output token(s) - which in theory could be any arbitrary value, but in practice is seahorse emoji, word2vec style.
the only problem is the seahorse emoji doesn't exist! so when this seahorse emoji residual hits the lm_head, it does its dot product over all the vectors, and the sampler picks the closest token - a fish emoji.
now, that discretization is valuable information! you can see in Armistice's example that when the token gets emplaced back into the context autoregressively, the model can tell it isn't a seahorse emoji. so it tries again, jiggles the residual around and gets a slightly different emoji, rinse and repeat until it realizes what's going on, gives up, or runs out of output tokens.
but until the model gets the wrong output token from the lm_head, it just doesn't know that there isn't a seahorse emoji in the lm_head. it assumes that seahorse emoji will produce the token(s) it wants.
------------------
to speculate (even more), i wonder if this a part of the benefit of RL - it gives the models information about their lm_head that's otherwise difficult to get at because it's at the end of the layer stack. (remember that base models are not trained on their own outputs / rollouts - that only happens in RL.)