I'm calling the Myth of Context Length:
Don't get too excited by claims of 1M or even 1B context tokens. You know what, LSTMs already achieve infinite context length 25 yrs ago!
What truly matters is how well the model actually uses the context. It's easy to make seemingly wild claims, but much harder to solve real problems better.
I highly recommend "Lost in the Middle: How Language Models Use Long Contexts" from Stanford. It's jam-packed with rigorous experiments that put popular long-context models to test. Key findings:
โธ Figure 1 (left): models are good at using information located at the very beginning or end of its context, but significantly worse at the middle. This isn't unique to GPT architecture - encoder-decoder like Flan-T5 also suffers in the middle
โธ Figure 2 (right): models that are natively longer context do NOT actually use the context better. You can see that the curves of GPT-3.5 (4k vs 16k) almost completely overlap.
โธ Model performance substantially decreases as input contexts grow longer, regardless of their native length.
We don't need more tokens. We need models that actually pay attention to them (pun intended).