Want to know why Mamba beats other state-space models—and where it falls short?
Then check out our #NeurIPS 2024 paper: "Theoretical Foundations of Deep Selective State-Space Models."
🔗 Read the paper: arxiv.org/abs/2402.19047 💻 Access the code: github.com/Benjamin-Walker/s…
🧵1/6
Adam depends on the gradient distribution during training, which, as far as I know, we don't understand well?
Here, adapted from the Adam paper, v_t is the var estimate, G_t is the gradient r.v. and X_t is an error r.v. for distribution shift.
Should we be trying to detect distribution shift and correct it (eg. by taking more samples at the same set of parameters)? Does this matter in some models and not others?
I am genuinely interested! Empirical research around how well our ad-hoc estimates (of gradients, variances, and Fisher information for example) perform is surprisingly limited, since it needs to be constantly reevaluated as SOTA changes
My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n
⭐ Lineax is now on arXiv! ⭐
If you’re doing linear solves or linear least-squares in JAX, give it a shot today!
Lineax
is fast ⚡️,
has new solvers (eg. QR, tridiagonal),
supports general linear Operators.
github: github.com/google/lineax
arXiv: arxiv.org/abs/2311.17283
1/n
The paper describes out how we achieved many of these things (such as differentiation through all our solvers,) and outlines some of the design choices we made when creating Lineax.
arxiv.org/abs/2311.17283
Finally, a million thanks to @PatrickKidger, who supervised this whole project.
If you’re following me, chances are good you already follow him. If not, go give him a follow! (right after installing Lineax of course 😉)
4/4
Tikhnov regularised trust-region methods (*cough* Levenberg-Marquardt) oddly use two different approximations to the objective function at each step.
One regularised, one not.
What if we just regularised both?
1/
They mentioned the choice between these two model functions made little difference in practice.
While this is believable, and indeed proved to be true, it struck me as an example of a claim which is very difficult to verify using existing optimisation software.
12/12