Accelerate your transformer model with the new Block-Sparse-Flash-Attention! github.com/Danielohayon/Bloc…
This training-free, drop-in replacement extends FlashAttention-2 with minimal code changes (CUDA Kernels Included). Paper: arxiv.org/abs/2512.07011
[1/6] Tomorrow (Thursday) at #NeurIPS:
Are Greedy Task Orderings Better Than Random in Continual Linear Regression?
Q: Do models learn better when consecutive tasks are similar or dissimilar?
A: Our analysis suggests that they should be dissimilar!
openreview.net/forum?id=8JdP…
[1/5] Next week at #NeurIPS
*Optimal Rates in Continual Linear Regression via Increasing Regularization*
In the brain, ageing naturally reduces synaptic plasticity.
Our theory suggests continual learning models may benefit from a similar mechanism!
openreview.net/forum?id=lDh7…