RoPE vs NoPE in hybrid linear attention models like Kimi Linear / Qwen3Next is tricky
for example, when using NoPE, we found that slowly expanding the window size of the attention layers during training (64->4k) greatly helps convergence:
You see:
- a new arch that is better and faster than full attention verified with Kimi-style solidness.
I see:
- Starting with inferior performance even on short contexts. Nothing works and nobody knows why.
- Tweaking every possible hyper-parameter to grasp what is wrong.
- Trying to find the efficient chunkwise parallelizable form to squeeze juice out of the GPU
- RoPE or NoPE, a question haunting for nights.
- Fighting buggy implementation that causes one of the long-context benchmarks drops ~20 pts.
- RL diverging. Aligning training-inference numerics.
- Dedicated efforts to make sure comparisons are solid and fair.
- Going back-and-forth in a pool of adversarial gate-keeping tests, and finally it survives.
Great teamwork!