Interpretability is built on a few core assumptions.
Two of our ICLR 2026
@iclr_conf papers suggest some of those assumptions are wrong (or at least highly incomplete).
1. Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning
arxiv.org/abs/2601.20075
much of the field has internalized an interpretability–accuracy trade-off: if you want cleaner, more human-understandable features, you sacrifice performance.
however, we find that this trade-off is not fundamental.
instead of relying on post-hoc methods (e.g. sparse autoencoders trained on frozen representations), we incorporate sparsity directly into CLIP training.
surprisingly, this produces features that are significantly more interpretable while preserving downstream performance.
this result made me more optimistic about intrinsically interpretable models, a direction that was imo written off too early.
-
2. Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
arxiv.org/abs/2510.08638
a lot of interpretability work implicitly assumes that vision representations behave like language: sparse, linear, and decomposable into independent features.
we find that this assumption is often misleading.
instead, vision representations appear partially dense and geometrically structured.
we propose the Minkowski Representation Hypothesis: tokens live in sums of convex regions formed from a small set of “archetypes,” rather than isolated features along linear directions.
this reframes how different tasks (classification, segmentation, depth) recruit and organize concepts. it also suggests that many current interpretability tools are mismatched to the actual structure of vision data.
--
tldr; interpretability can be built into training with surprisingly simple tweaks, and that different modalities have different sparsities/geometries. Tailoring the interp method to the modality is super impt!