🔊 Not to miss …. last month @anna_hedstroem defended her PhD “Evaluation-centric advances in neural model interpretability” at TU Berlin — with distinction! ✨🧠💻☕️
Here’s a thread of a selection of Anna’s evaluation-centric interpretability work what comes next. 🧵
📍 Now @anna_hedstroem is a Postdoctoral Fellow at the @ETH_AI_Center, working with the @ivia_lab and Learning & Adaptive Systems (LAS) group.
Anna's focus ahead: evaluation-centric interpretability, LLM steering, and AI safety. ✨🧠💻☕️
More info: annahedstroem.com/.
Happy to share that our PRISM paper has been accepted at #NeurIPS2025 🎉
In this work, we introduce a multi-concept feature description framework that can identify and score polysemantic features.
📄 Paper: arxiv.org/abs/2506.15538#NeurIPS#MechInterp#XAI
🎉 Huge congratulations to @kirill_bykov, the very first PhD student of our lab, who successfully defended his thesis “Explaining Representations in Deep Neural Networks” this Monday with summa cum laude! 👏
🧵 In the next tweets, we’ll highlight some of his key works:
🧐 DORA: Exploring Outlier Representations in Deep Neural Networks (TMLR 2023) A framework for analyzing & detecting learned representations in neural networks.
👉 arxiv.org/abs/2206.04530
📚 During his PhD, Kirill co-authored 11 papers spanning interpretability, neuron analysis & robust explanations. You can find all of them on his Google Scholar:
👉 scholar.google.com/citations…
Once again, congrats @kirill_bykov on an outstanding PhD journey! 🎓✨
🔍 When do neurons encode multiple concepts?
We introduce PRISM, a framework for extracting multi-concept feature descriptions to better understand polysemanticity.
📄 Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
arxiv.org/abs/2506.15538
🧵
At 12:30 I'll be happy to take questions about our poster presentation at #AAAI2025. Is your explanation for a model's prediction better than the alternatives? "Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution" introduces QGE... 1/4