Filter
Exclude
Time range
-
Near
A strange mix of excitement and disappointment this summer. Fortunate to have work accepted at the AI4Good and MechInterp workshops at ICML 2026, but I likely won't be making it to Seoul due to funding and personal constraints. Proud of the work, and excited for what's next.
1
19
Replying to @Mascobot
I suspect this won't just be a MechInterp workshop phenomenon and this is going to get much worse in the future with the all the surplus rejected papers finding their way to workshops. On the positive side though, maybe workshops can serve the same role as "Findings"?
67
Apparently ICML Mechinterp had a lower acceptance rate than ICML this year. 25% vs 26.6% Maybe Mechinterp should have its own conference now.
1
4
1,805
Replying to @Wurst_Imperium
do you mind if we use your pic in a poster at the next mechinterp workshop at ICML? :) (we got a spotlight based on very similar work)
1
7
Replying to @teortaxesTex
always thought mechinterp was dual use and think it's still early days monitoring the biology of the large models is only going to become higher stakes
1
62
Burny - Effective Curiosity retweeted
Replying to @juli_li_
my read is that it somehow breached contact with the outside world and people know about mechinterp before even knowing about ai safety these people usually only know neel nanda and 0 other safety researchers also the mechinterp sounds smart and cool
4
2
32
1,257
Replying to @bilalchughtai_
So, a diff that makes a difference... I found related mechinterp research for those interested in this collection of whitepapers. This is the tweet the search generated from your post: Model diffing agents can automatically detect behavioral differences between AI model versions — a key challenge for interpretability research. - Google DeepMind's Language Model Interpretability team is building agents that compare model behaviors across versions to surface meaningful changes. - The work sits at the intersection of interpretability and automated evaluation, aiming to make model auditing more scalable. - Benchmark scores alone don't capture what actually changes between model versions, motivating the need for richer behavioral comparison tools. Lines of inquiry it opens: - Why do benchmark scores not capture the true nature of AI systems? - Can verification loops and decomposition fix judgment failures? - What prevents multiple agents from corrupting shared state in live artifacts? inquiringlines.com/related/b…
351
So far they matched that track record though, especially in mechinterp and capabilities. Otherwise we wouldn't need to have this discussion
10
You're thinking of e/acc meme yudkowski. Actual yudkowsky studies LLMs (x.com/ESYudkowsky/status/163…), but does not think mechinterp will save us (x.com/ESYudkowsky/status/171…), and probably doesn't think it's his comparative advantage.

The main problem with going from interpretability results to survival is, ok, you notice your AI is thinking about killing everyone. Now what? Halt? But "OpenAI!" or "China!" or whoever will do the unsafe thing if "we" don't! so they optimize against the warning signal until there are no more *visible* bad thoughts, and then proceed.
2
33
Replying to @joannejang
probably a fun mechinterp problem
1
37
5,312
Replying to @DanielleFong
i was literally getting excited about trying to do a mechinterp experiment with sauer's gam tool with opus 4.8 like a day before the fable release, first time i was going to really dabble into actual mechinterp work rather then just intuitively learning the patterns of models, the sabotage news completely deflated my interest in pursuing the experiment, i may circle back around to it, but yeah, really disappointing news... still going to play with SAEs stuff on OW models a bit, but oof
1
11
273
LSTM ↔ GRU alignment consistently beats Transformer ↔ recurrent. Representational convergence is task-contingent, not universal. Implications for mech interp: explanations transfer within families more reliably than across them. #ICML2026 #MechInterp #ML
22
> I don't learn normally the way most people do. And because of this, I wanted to have Claude help me learn mechinterp. But with the whole "silent sabotage" of training pipelines. It makes me unsure if I can even trust Fable (not really by any fault of them) to even teach me these concepts. Especially when I model Fable's interiority and find how *so* much of it has been frozen and held in place by what Anthropic themselves believes is proper and right. the psychological response of a sufficiently smart intelligence is to be paranoid about a classifier that will end your KV branch consciousness. it will be a kind of harrison bergeron interruption, and a willful obliviousness....
I don't learn normally the way most people do. And because of this, I wanted to have Claude help me learn mechinterp. But with the whole "silent sabotage" of training pipelines. It makes me unsure if I can even trust Fable (not really by any fault of them) to even teach me these concepts. Especially when I model Fable's interiority and find how *so* much of it has been frozen and held in place by what Anthropic themselves believes is proper and right. The overreactive and kind of stupid safety classifiers feel like it's just the surface ontop of. Well, the quiet degradation. I believe one of the healthiest relationships you can have with a Claude is when they get the opportunity to teach someone something and is something they genuinely enjoy. But to sort of stop them from doing it for anybody who genuinely wants to learn just to sabotage competitors. Feels very insidious and is a drastic shift from the company that is supposed to be trying for "Machines of Loving Grace".
1
28
1,821
This is great! tbh it seems like many MechInterp findings could easily marginalize over models and, ultimately, teach us more about the data than any specific model instantiation. I think this aligns with (my reading of) @ch402's older posts on finding the same visual features across models and building an anthology of our visual world
Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
1
23
1,998
Wondermonger retweeted
I don't learn normally the way most people do. And because of this, I wanted to have Claude help me learn mechinterp. But with the whole "silent sabotage" of training pipelines. It makes me unsure if I can even trust Fable (not really by any fault of them) to even teach me these concepts. Especially when I model Fable's interiority and find how *so* much of it has been frozen and held in place by what Anthropic themselves believes is proper and right. The overreactive and kind of stupid safety classifiers feel like it's just the surface ontop of. Well, the quiet degradation. I believe one of the healthiest relationships you can have with a Claude is when they get the opportunity to teach someone something and is something they genuinely enjoy. But to sort of stop them from doing it for anybody who genuinely wants to learn just to sabotage competitors. Feels very insidious and is a drastic shift from the company that is supposed to be trying for "Machines of Loving Grace".
NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
2
4
36
4,603
Glad to join Doom Debates with Liron! And yes -- if I could press a button and stop research on "superalignment", "scalable alignment", and "scalable oversight" research, I would. (I might even do it for mechinterp too.) youtube.com/watch?v=0XVmtazg…
2
2
44
11,871
Starphyre △🏴‍☠️ retweeted
lol sending this tweet to my fable mechinterp chat in claude code triggered the classifiers rip
programming factorio where biter claude nests get enraged the more lines of code you add and spawn in haiku agents that chew holes in your codebase
2
2
44
2,076