Most unsupervised "feature discovery" in LLMs uses sparse auto-encoders, which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments – a reconstruction loss and a sparsity loss over a fixed-size dictionary – into a single training objective.
Those commitments make sense if your goal is reconstructive decomposition. They make less obvious sense if your aim is to find interpretable structure (directions? features?) in activation space, to retrieve representative examples, identify causal interventions, or measure how representations change across layers and inputs. It turns out a lot of that doesn't need the full SAE machinery.
Exemplar Partitioning (EP) uses leader-clustering (Hartigan, 1975!) to cover the activation manifold with observed exemplars at a calibrated resolution, resulting in a Voronoi partition of activation space that you can read like a feature dictionary.
EP makes one streaming pass over the data until saturation (when no new exemplars form), and uses no backward passes or gradient descent. The animation above shows the algorithm – each new activation either joins an existing cell (close enough to an exemplar) or seeds a new one. It's extraordinarily simple and cheap.
On AxBench latent concept detection at Gemma-2-2B-it L20, EP reaches 0.881 mean AUROC across 500 concepts. That's within 0.03 of SAE-A (AxBench's strongest dictionary-based baseline), and 0.126 over the canonical GemmaScope 16k SAE leaderboard entry – with about 1,000× less build compute.
And you can do a lot interesting stuff with the resulting dictionary!
If you build it on a mix of harmful and benign prompts, one region absorbs most of the refusing prompts. Projecting held-out harmful prompts off that exemplar's direction collapses refusal from around 0.98 to around 0.02 – the same ballpark as dedicated refusal-direction work (Arditi et al., 2024).
If you build the EP dictionary to saturation on a corpus (e.g. the Pile), distance-to-nearest-exemplar becomes a graded measure of distribution shift, for free. Random-token-sequence activations sit measurably further out than Pile activations, and Bulgarian Wikipedia (under-represented in the Pile but not really OOD) sits between the two.
Because exemplars are real activations rather than learned decoder columns, you can match dictionaries across different models by their exemplars. If you match EP dictionaries from base vs instruction-tuned Gemma-2-2B, only a handful of regions survive as common, mostly general-purpose syntactic patterns. You can also see how the base model already represents "harmful" as a direction at earlier prompt positions, and instruction tuning pulls it forward to the final-token activation where the refusal decision is made.
The saturated size of a dictionary on a given input stream is itself a measurement of that stream's activation geometry at each layer. On the same model, the proportion of activation space dedicated to chat grows monotonically with depth, code is essentially flat across the network (and lives in a smaller area of activation space than chat does, at every layer), and math is non-monotonic, peaking in the middle.
EP and SAEs don't converge on the same features, aside from a shared core of about 20%. The two methods make different geometric commitments – SAEs to linear separability, EP to density.
The experiments I've done so far are small-scale and exploratory, and I have only tested on Gemma-2-2b. There's a huge amount of further work to be done (both in terms of improving the method and applying it to more tasks), some of which is discussed in the post and paper.
If you are an interpretability researcher interested in developing this method please check out the github repo and get stuck in!
Post:
lesswrong.com/posts/RroeHBSk…
Paper:
arxiv.org/abs/2605.14347
Code:
github.com/jessicarumbelow/e…