This paper shows how to add sliding window attention to full attention models, without retraining, while keeping long context quality.
It makes long prompts much cheaper and faster to read, because the model stops comparing every token to every earlier token, and instead mostly looks at a local window.
Their practical outcome is a set of simple deployment recipes that let existing full attention models use sliding windows at inference time without full retraining, so teams can cut prefill cost while keeping most of the long context accuracy, especially by using sliding windows for prefilling but switching back to full attention while generating the answer.
Full attention is expensive because every token compares with all earlier tokens, so the work grows fast with length.
Sliding window attention makes each token look only at a recent chunk, but that surprises models trained to see all history.
Their fix, Sliding Window Attention Adaptation, mixes windowed prompt reading, keeps the beginning tokens visible, and leaves some layers as full attention.
Full attention decode is key, it reads the prompt with a window, then generates with full attention so the answer can use the context.
They use chain of thought reasoning, and Low Rank Adaptation fine-tuning, meaning only a small addon is trained, to make sliding windows behave.
No 1 trick works, but combining methods recovers most long context accuracy and cuts the cost of reading prompts.
----
Paper Link – arxiv. org/abs/2512.10411
Paper Title: "Sliding Window Attention Adaptation"