Using interpretability to understand, learn from, and design AI.

Joined August 2024
154 Photos and videos
Pinned Tweet
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
306
1,674
11,183
3,157,803
Goodfire retweeted
Happy to see our work cited in the Claude Fable & Mythos system card! Steering against eval awareness can carry confounds (e.g. making the model more friendly). Interpretability can help us understand these, and is a promising source of new methods to deal with eval awareness.
1
7
33
1,443
Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
26
107
878
170,083
If you train models on preference data, you have a curriculum you've never read. Predictive data debugging lets you read it, understand it, and rewrite it. We've built it into Silico, our platform for model design. Request access to Silico here: goodfire.ai/silico (9/9)
2
1
52
4,068
Cool work applying the idea from our work on RLFR to RL task generation!
Training a model to generate RL tasks not too hard, not too easy costs many solver runs per task. PROPEL predicts difficulty via a probe on its activations instead, amortizing cost and speeding up generator optimization. New open-ended RL research from @Vmax @GoodfireAI.
3
2
68
7,400
New Goodfire research: using logits to monitor for eval awareness!
Would an LLM tell you if it’s gaming your eval? Often, no. But we can still catch the model thinking about it. New research: we measure how close a model comes to saying it’s being tested. This detects eval awareness with 10× to 100× fewer samples than monitoring model outputs.🧵
2
8
110
10,641
Goodfire retweeted
The idea that launched @GoodfireAI🔥 When ChatGPT launched, most people were blinded by the possibilities. Eric Ho saw the risk it posed. "I kind of saw the next few years unfold before me, where we were about to get increasingly powerful models ... massive amounts of compute, massive amounts of intelligence, but we wouldn't understand at all how this intelligence would actually work." Most models operate as a black box. Users can see the output, but they can't reliably see how the model reached it, why it behaved in a certain way, or whether it will behave the same way again next time. With how quickly AI is being deployed into mission-critical environments, Eric wanted to do something to ensure AI models are functioning as intended. In a new video interview with Emily Zhao, Eric explained his aim behind founding Goodfire: to build the science and technology needed to understand AI from the inside out.
3
4
18
1,911
Goodfire retweeted
Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
4
19
184
16,251
New research from Goodfire and collaborators: why do larger models learn more tasks? (spoiler: it’s bottlenecked by data)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
3
14
179
21,547
Goodfire retweeted
The "tiling" perspective explains a lot of the common problems with SAEs
The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)
6
89
15,785
The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
25
151
1,017
173,391
So instead of interpreting features in isolation, what if we searched for features that act together? We turned this idea into an unsupervised pipeline to cluster SAE features based on firing patterns. Together, a cluster of features reveals the overall geometry. (6/7)
2
6
139
25,535
SAEs remain useful, as long as we’re aware of their limitations. And we have new techniques in the works that recover manifolds more directly, allowing us to understand models better and control them more effectively! Read the full post here: goodfire.ai/research/can-sae…
7
94
4,452