NLP | Interpretability | PhD student at the @TechnionLive

Joined March 2022
27 Photos and videos
Pinned Tweet
Check out our new paper πŸ”₯ It’s been so much fun working on this project!
Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)
2
2
35
1,516
Tal Haklay retweeted
Do reasoning models internally represent abstract properties of their own chain of thought (such as "which steps are important"), while not surfacing these properties in their tokens?
1
9
25
1,014
Tal Haklay retweeted
Happy to see our work cited in the Claude Fable & Mythos system card! Steering against eval awareness can carry confounds (e.g. making the model more friendly). Interpretability can help us understand these, and is a promising source of new methods to deal with eval awareness.
1
7
37
1,723
Tal Haklay retweeted
Super excited about this work! This paper was driven by a claim I've been making to anyone who'll listen: "interpretability is the language of data". (1/3)
Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
2
21
138
12,414
Tal Haklay retweeted
Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
26
108
886
172,361
Tal Haklay retweeted
πŸ“’ We’re looking for reviewers for the Actionable Interpretability workshop @ActInterp! If you’re interested in helping review submitted papers, please sign up here: forms.gle/VpLJpkM6zw3V8bX56 Your expertise would be greatly appreciated!
7
26
2,002
Tal Haklay retweeted
At CVPR this week for a talk on neural geometry of large vision models. If you’re interested in interpretability or joining @GoodfireAI, come say hi. 🀠
🧡HOW speaker spotlight @CVPR ! Next up we have @thomas_fel_ from @GoodfireAI πŸ”₯ Thomas will talk "Neural Geometry in Large Vision Models", diving into the structure hidden inside vision models. πŸ“… June 4 @ Room 1Ef | 10:30–11:00 AM
2
15
89
8,007
Tal Haklay retweeted
Looking forward to giving a keynote at the @WiCVworkshop dinner tonight! If you're attending, come say hi!
On my way to Denver for #CVPR2026, DM me if you want to connect. See you at our workshop on Thursday! x.com/i/status/2017429823983…
3
32
1,338
Tal Haklay retweeted
Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
4
19
184
16,286
Tal Haklay retweeted
New research from Goodfire and collaborators: why do larger models learn more tasks? (spoiler: it’s bottlenecked by data)
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
3
14
179
21,569
Tal Haklay retweeted
We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Huang and @EkdeepL, traces this to a data-induced competition for resources (neurons), using formal analysis, idealized tasks, and real pretraining.
20
136
914
137,778
Submit your work! The 2nd Workshop on π€πœπ­π’π¨π§πšπ›π₯𝐞 πˆπ§π­πžπ«π©π«πžπ­πšπ›π’π₯𝐒𝐭𝐲 will be held at COLM 2026 in San Francisco! Submission Deadline: June 21, 2026 @ActInterp
1
2
29
1,381
Tal Haklay retweeted
What is the role of text tokens in diffusion? Do they carry anything beyond the text prompt? We study this in FLUX.2 @bfl_ml for the task of reference-guided generation, and found that text tokens hold visual information from the reference image!
FLUX.2's @bfl_ml text tokens aren't just holding your prompt. During image editing, they absorb reference image content, and some of that absorbed content, like color and style, causally drives the output appearance. New paper πŸ§΅πŸ‘‡
2
7
20
2,651
Tal Haklay retweeted
FLUX.2's @bfl_ml text tokens aren't just holding your prompt. During image editing, they absorb reference image content, and some of that absorbed content, like color and style, causally drives the output appearance. New paper πŸ§΅πŸ‘‡
7
35
203
26,076
Tal Haklay retweeted
Submit your work! The 2nd Workshop on π€πœπ­π’π¨π§πšπ›π₯𝐞 πˆπ§π­πžπ«π©π«πžπ­πšπ›π’π₯𝐒𝐭𝐲 will be held at COLM 2026 in San Francisco! Submission Deadline: June 21, 2026 @ActInterp
2
18
132
13,897
Tal Haklay retweeted
SAEs remain useful, as long as we’re aware of their limitations. And we have new techniques in the works that recover manifolds more directly, allowing us to understand models better and control them more effectively! Read the full post here: goodfire.ai/research/can-sae…
7
94
4,461
Tal Haklay retweeted
This helps explain why SAEs can feel both illuminating and unsatisfying! Looking at SAE features one-by-one is like trying to understand the proverbial elephant by talking with each of the blind men: each label may be locally accurate, but the global structure is missing. (5/7)
1
1
57
3,103
Tal Haklay retweeted
This would provide a great explanation for why there is so much redundancy in SAE features at any given layer (observation made by @Sauers_ ). For example, if you search through the Qwen3-4b transcoder feature labels provided by Neuronpedia, there are 139 features generically related to the concept of 'color' in just layer 14. There are even more if you consider specific colors such as 'blue' or 'green', and this redundancy is repeated across layers... making it very annoying to interpret raw circuit graphs without performing some form of clustering.
Replying to @GoodfireAI
We now know that models think using curved shapes, not just straight lines. But SAE features can still give us a window into neural geometry. How? We show that related SAE features often β€œtile” manifolds, pointing to different (but overlapping) regions on the curve. (4/7)
4
6
70
19,233
Tal Haklay retweeted
Consider the parable of the blind men encountering an elephant for the first time. Each touches a different partβ€”the trunk, the tusk, the legβ€”and comes to a different conclusion about the elephant: one says it's like a tree, another says it’s like a rope, and so on. (2/7)
2
2
82
5,171
You should read our paper β€” and stay tuned πŸ‘€
The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)
3
67
5,541