So excited to finally share this!
Linear probes often outperform SAEs, especially out-of-distribution (OOD).
@thesubhashk @JoshAEngels et al showed this convincingly (
arxiv.org/abs/2502.16681). This prompted
@NeelNanda5 and others to de-emphasize SAE research. Empirically, fair enough. But we think the theoretical case for dictionary learning was dismissed too quickly.
@oneill_c previously showed SAEs can't do proper sparse coding (
arxiv.org/abs/2411.13117).
@shruti_joshi @vpacela and
@isacama_phys took this further and showed how this leads to problems particularly in OOD settings. So the issue may not be with dictionary learning itself, but with the current tools.
Here's the core argument: if neural representations are in superposition, i.e. more features than dimensions encoded linearly (
arxiv.org/abs/2503.01824), then linear probes fundamentally cannot be the answer.
This is a compressed sensing problem. There's a linear measurement (the representation) and a nonlinear inference procedure (like an SAE encoder) that recovers the higher-dimensional sparse signal. Linear algebra tells us error-free recovery is impossible if decoding is restricted to be linear. (but see this cool work if errors are acceptable
arxiv.org/abs/2602.11246)
Check out our video: We have some neat demonstrations here. A linear decision boundary in 3D becomes nonlinear in 2D, even though all sparse combinations of latents remain distinguishable. Compressed sensing works: we can, in principle, recover the high-dimensional latent space where linear probes work and generalize OOD.
Where does this leave us? With finite data and millions of concepts, simpler methods may perform better for a while. But if we want interpretability and safety methods that work OOD, especially compositional generalization covering all possible jailbreaks and real-world failures, we'll have to build bottom up from the right theory.
@kennylpeng @thebasepoint @tegmark @yash_j_sharma @woog09 @livgorton @EkdeepL @thomas_fel_ @nsaphra
SAEs fail at OOD tasks. Why?
Features in superposition are linearly representable but not linearly accessible. Instead of discarding sparse coding, we embrace the geometry of superposition and use methods equipped to handle the nonlinearity it induces.