I ran a small experiment to make information theory feel less abstract.I trained a tiny character-level Transformer on TinyShakespeare, then corrupted the text by randomly shuffling different percentages of character positions:
0%, 1%, 2%, 5%, 10%, 20%, 40%
The model tracks cross-entropy loss, perplexity, and bits-per-character. Since this is char-level, BPC is just:
loss / ln(2)
At initialization, the vocab size was 65, so a clueless model should assign about 1/65 probability to each next character. That means initial loss should be:
-ln(1/65) = ln(65) ≈ 4.17 nats
Observed initial loss was ~4.21 nats, so the model was behaving like a near-uniform predictor before training.
After training, final BPC rose with corruption:
0%: 2.58
1%: 2.70
2%: 2.81
5%: 3.07
10%: 3.45
20%: 3.97
40%: 4.55
> More corruption destroys predictable structure, so the model needs more bits per character to model the text.
> That last part is what made the Shannon idea click for me. Information content is not the same thing as human meaning. By shuffling characters, I destroyed meaning in the normal sense. The text became worse, less readable, less Shakespeare.
But to the model, it also became more surprising. The local patterns were damaged, so it needed more bits per character to encode/predict the sequence.
Meaning went down.
Uncertainty went up.
Bit cost went up.
That distinction finally felt concrete.