Aditya Iyer

Aditya Iyer

Users
Tweets

A lot of mechanistic interpretability techniques rely on working with the residual stream in some way. I wrote a short post unpacking one important property: additivity. The key idea is that once an attention head or MLP neuron computes its output, it writes into the residual stream by addition. Using simple block matrix multiplication, you can decompose the stream into additive contributions from individual attention heads, MLP neurons, and bias terms. This makes the residual stream a natural object for circuit analysis. Every component leaves a traceable, additive footprint. Full derivation in the post below. adityaiyer7.github.io/blogs/… #MechanisticInterpretability #AIInterpretability #AIAlignment #TransformerCircuits #Transformers

Mayank Bhaskar

Mayank Bhaskar @cataluna84

19 Sep 2024

Continuing on the @AnthropicAI's Transformer Circuit series and as a part of daily paper discussions on the @ykilcher discord server, I will be volunteering to lead the analysis of the following mechanistic interpretability work 🧮 🔍 📜 Toy Models of Superposition authored by Nelson Elhage, @trishume, @catherineols, @nschiefer, et al. 🌐 transformer-circuits.pub/202… 🕰 Friday, Sep 19, 2024 12:30 AM UTC // Friday, Sep 19, 2024 6.00 AM IST // Thursday, Sep 18, 2024 5:30 PM PT Previous Mechanistic Interpretability papers in this series that we talked about: 🔬 Softmax Linear Units @ transformer-circuits.pub/202… 🔬 In-context Learning and Induction Heads @ transformer-circuits.pub/202… 🔬 A Mathematical Framework for Transformer Circuits @ transformer-circuits.pub/202… Join in for the fun ~ ykilcher.com/discord #DailyPaperDiscussions #TransformerCircuits #MechanisticInterpretability

389

Mayank Bhaskar

Mayank Bhaskar @cataluna84

18 Sep 2024

Continuing on the @AnthropicAI's Transformer Circuit series and as a part of daily paper discussions on the @ykilcher discord server, I will be volunteering to lead the analysis of mechanistic interpretability work 🔍 📜 Softmax Linear Units authored by Nelson Elhage, @trishume, @catherineols, @NeelNanda5, et al. 🌐 transformer-circuits.pub/202… 🕰 Thursday, Sep 19, 2024 12:30 AM UTC // Thursday, Sep 19, 2024 6.00 AM IST // Wednesday, Sep 18, 2024 5:30 PM PT Previous Mechanistic Interpretability papers in this series: 🔬 In-context Learning and Induction Heads @ transformer-circuits.pub/202… 🔬 A Mathematical Framework for Transformer Circuits @ transformer-circuits.pub/202… Join in for the fun ~ ykilcher.com/discord #DailyPaperDiscussions #TransformerCircuits #MechanisticInterpretability

192