"Screening Is Enough"
> [Softmax attention] does not provide an independently interpretable measure of query—key relevance: attn. scores are unbounded, while attn. weights are defined only relative to competing keys. Consequently, irrelevant keys cannot be explicitly rejected...
I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.
✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection
I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!
Paper →
arxiv.org/abs/2604.01178