Krishak Aneja

Krishak Aneja

Users
Tweets

20h

A strange mix of excitement and disappointment this summer. Fortunate to have work accepted at the AI4Good and MechInterp workshops at ICML 2026, but I likely won't be making it to Seoul due to funding and personal constraints. Proud of the work, and excited for what's next.

Apurv Verma

Apurv Verma

@verma_apurv5

Jun 14

Replying to @Mascobot

I suspect this won't just be a MechInterp workshop phenomenon and this is going to get much worse in the future with the all the surplus rejected papers finding their way to workshops. On the positive side though, maybe workshops can serve the same role as "Findings"?

Marco Mascorro

Marco Mascorro

@Mascobot

Jun 13

Apparently ICML Mechinterp had a lower acceptance rate than ICML this year. 25% vs 26.6% Maybe Mechinterp should have its own conference now.

1,805

Karolis Jucys

Karolis Jucys @Karolis_Ram

Jun 13

Replying to @Wurst_Imperium

do you mind if we use your pic in a poster at the next mechinterp workshop at ICML? :) (we got a spotlight based on very similar work)

dextersjab

dextersjab @dextersjab

Jun 13

Replying to @teortaxesTex

always thought mechinterp was dual use and think it's still early days monitoring the biology of the large models is only going to become higher stakes

anpaure ✈️ icml

Burny - Effective Curiosity retweeted

anpaure ✈️ icml

@anpaure

Jun 12

Replying to @juli_li_

my read is that it somehow breached contact with the outside world and people know about mechinterp before even knowing about ai safety these people usually only know neel nanda and 0 other safety researchers also the mechinterp sounds smart and cool

1,257

Adrian Chan

Adrian Chan

@gravity7

Jun 12

Replying to @bilalchughtai_

So, a diff that makes a difference... I found related mechinterp research for those interested in this collection of whitepapers. This is the tweet the search generated from your post: Model diffing agents can automatically detect behavioral differences between AI model versions — a key challenge for interpretability research. - Google DeepMind's Language Model Interpretability team is building agents that compare model behaviors across versions to surface meaningful changes. - The work sits at the intersection of interpretability and automated evaluation, aiming to make model auditing more scalable. - Benchmark scores alone don't capture what actually changes between model versions, motivating the need for richer behavioral comparison tools. Lines of inquiry it opens: - Why do benchmark scores not capture the true nature of AI systems? - Can verification loops and decomposition fix judgment failures? - What prevents multiple agents from corrupting shared state in live artifacts? inquiringlines.com/related/b…

Building and evaluating model diffing agents · Gravity7

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent area

inquiringlines.com

351

Tensor Templar

Tensor Templar

@TensorTemplar

Jun 12

Replying to @jsuarez @natolambert

So far they matched that track record though, especially in mechinterp and capabilities. Otherwise we wouldn't need to have this discussion

Steve Bachelor

Steve Bachelor @speedprior

Jun 12

Replying to @artrockalter @QiaochuYuan

You're thinking of e/acc meme yudkowski. Actual yudkowsky studies LLMs (x.com/ESYudkowsky/status/163…), but does not think mechinterp will save us (x.com/ESYudkowsky/status/171…), and probably doesn't think it's his comparative advantage.

Eliezer Yudkowsky ⏹️

@ESYudkowsky

6 Oct 2023

Replying to @ESYudkowsky @littIeramblings

The main problem with going from interpretability results to survival is, ok, you notice your AI is thinking about killing everyone. Now what? Halt? But "OpenAI!" or "China!" or whoever will do the unsafe thing if "we" don't! so they optimize against the warning signal until there are no more *visible* bad thoughts, and then proceed.

Henry Dowling

Henry Dowling

@henrytdowling

Jun 11

Replying to @joannejang

probably a fun mechinterp problem

5,312

Utah teapot 🫖🔜vibecamp

Utah teapot 🫖🔜vibecamp

@SkyeSharkie

Jun 11

Replying to @DanielleFong

i was literally getting excited about trying to do a mechinterp experiment with sauer's gam tool with opus 4.8 like a day before the fable release, first time i was going to really dabble into actual mechinterp work rather then just intuitively learning the patterns of models, the sabotage news completely deflated my interest in pursuing the experiment, i may circle back around to it, but yeah, really disappointing news... still going to play with SAEs stuff on OW models a bit, but oof

273

Harsh Rathva @ICML2026🇰🇷

Harsh Rathva @ICML2026🇰🇷@HarshRathv4

Jun 11

LSTM ↔ GRU alignment consistently beats Transformer ↔ recurrent. Representational convergence is task-contingent, not universal. Implications for mech interp: explanations transfer within families more reliably than across them. #ICML2026 #MechInterp #ML

Danielle Fong 🔆

Danielle Fong 🔆

@DanielleFong

Jun 11

> I don't learn normally the way most people do. And because of this, I wanted to have Claude help me learn mechinterp. But with the whole "silent sabotage" of training pipelines. It makes me unsure if I can even trust Fable (not really by any fault of them) to even teach me these concepts. Especially when I model Fable's interiority and find how *so* much of it has been frozen and held in place by what Anthropic themselves believes is proper and right. the psychological response of a sufficiently smart intelligence is to be paranoid about a classifier that will end your KV branch consciousness. it will be a kind of harrison bergeron interruption, and a willful obliviousness....

Kore

@Kore_wa_Kore

Jun 11

I don't learn normally the way most people do. And because of this, I wanted to have Claude help me learn mechinterp. But with the whole "silent sabotage" of training pipelines. It makes me unsure if I can even trust Fable (not really by any fault of them) to even teach me these concepts. Especially when I model Fable's interiority and find how *so* much of it has been frozen and held in place by what Anthropic themselves believes is proper and right. The overreactive and kind of stupid safety classifiers feel like it's just the surface ontop of. Well, the quiet degradation. I believe one of the healthiest relationships you can have with a Claude is when they get the opportunity to teach someone something and is something they genuinely enjoy. But to sort of stop them from doing it for anybody who genuinely wants to learn just to sabotage competitors. Feels very insidious and is a drastic shift from the company that is supposed to be trying for "Machines of Loving Grace".

1,821

David Klindt

David Klindt

@klindt_david

Jun 11

This is great! tbh it seems like many MechInterp findings could easily marginalize over models and, ultimately, teach us more about the data than any specific model instantiation. I think this aligns with (my reading of) @ch402's older posts on finding the same visual features across models and building an anthology of our visual world

Goodfire

@GoodfireAI

Jun 11

Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)

0:34

1,998

Kore

Wondermonger retweeted

Kore

@Kore_wa_Kore

Jun 11

Max Zeff

@ZeffMax

Jun 11

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

4,603

Cas (Stephen Casper)

Cas (Stephen Casper)

@StephenLCasper

Jun 11

Glad to join Doom Debates with Liron! And yes -- if I could press a button and stop research on "superalignment", "scalable alignment", and "scalable oversight" research, I would. (I might even do it for mechinterp too.) youtube.com/watch?v=0XVmtazg…

This Harvard Professor Says AI Alignment Will BACKFIRE - Dr. Stephen...

Stephen Casper is an incoming professor of public policy at the Har...

youtube.com

11,871

thebes

Starphyre △🏴‍☠️ retweeted

thebes

@voooooogel

Jun 11

lol sending this tweet to my fable mechinterp chat in claude code triggered the classifiers rip

thebes

@voooooogel

Jun 11

programming factorio where biter claude nests get enraged the more lines of code you add and spawn in haiku agents that chew holes in your codebase

2,076