Today I learnt that in 2009, neuroscientists placed a dead Atlantic salmon into an fMRI scanner, scanned it, and that this has apparently implications for AI interpretability. š
They showed the dead fish pictures of humans in social situations and "asked" the fish to determine the emotions of the people. When they ran their standard statistical software, the results showed "brain activity" in the fish that correlated with the emotions. Obviously, the fish was not thinking; the "activity" was just random noise. The point of the study was to show that if you don't correct for statistical noise and use rigorous controls, your tools will find patterns where none exist.
This paper claims that the same lesson should be applied in interpretability work: many researchers use various tools to explain what is happening inside a neural network (e.g. probes, SAEs etc). But some of these convincing-looking explanations can also be extracted when applied to randomly initialized and untrained AI models (the dead salmon equivalent): saliency maps remain plausible after weight randomization, sparse autoencoders find interpretable components in random transformers etc.
The authors propose that we stop treating interpretability as "storytelling" and start treating it as statistical inference: doing null hypothesis testing, quantifying uncertainty more systematically, interpreting explanations as a simplified surrogate model etc. Although they also acknowledge that finding some signal in random networks doesn't automatically invalidate finding stronger signals in trained ones.
I'm not interpretability researcher myself but would be curious to hear takes!
arxiv.org/abs/2512.18792