A nod to "Attention Is All You Need" (Vaswani et al., 2017)
Before Transformers, our models read text like humans do: word by word, left to right, holding onto context through a kind of sequential memory. It was slow and limiting.
The breakthrough was realizing we could let the model look at everything simultaneously. Imagine reading a sentence where every word can instantly "talk" to every other word to figure out what's important. That's attention. When you read "The animal didn't cross the street because it was too tired," the model can instantly connect "it" to "animal" rather than "street" by having all words attend to each other in parallel.
This architecture unlocked more efficient training at scale. Suddenly we could train on massive datasets because we weren't bottlenecked by sequential processing. More importantly, these models could discover patterns and relationships across much longer contexts, enabling emergent capabilities we see today.
This made it possible to go from narrow, task-specific models to the general-purpose AI assistants we're building and using today (eg ChatGPT, Claude).