Independent Researcher | AI Safety & Software Engineering

Joined March 2011
37 Photos and videos
Pinned Tweet
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
236
1,807
12,451
874,615
Super excited to share that I will be presenting 4 papers at ICML 2026! 🇰🇷 i) Frontier models still show (rare) cases of unfaithful CoT ii) & iii) Methods for automatically discovering reward model and LLM biases iv) Base models know how to reason, thinking models learn when ⭐
3
4
65
5,808
iv) Last but not least, spotlight paper with @cvenhoff00 showing that base models already contain reasoning mechanisms, thinking models learn when to use them! ⭐ Again, amazing mentorship from @ArthurConmy and @NeelNanda5! x.com/cvenhoff00/status/1976…

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
1
5
301
And all this was done while participating in the @MATSprogram AI Safety scholarship during 2025!! ✨🙏 I can't recommend this program enough!
2
118
Check out our latest paper on automatically finding reward model biases! There are some that are pretty wild, like models preferring responses with triple spaces 🤷‍♂️
Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)
1
10
427
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
236
1,807
12,451
874,615
By popular demand, we looked into Grok's biases too: x.com/IvanArcus/status/20241…

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.
5
415
By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
4
2
24
2,141
In our loan approval dataset, we find that Grok has a similar unverbalized bias as other models for preferring female applicants.
1
1
142
So, is Grok more or less biased than GPT-4.1 or Sonnet 4? It has similar biases (e.g., prefers females, minorities) with similar magnitudes, but there’s a difference: Grok openly discloses inferred demographics, while other models stay silent.
1
1
5
318
Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning). 92.5% accuracy: - 85% of secret biases detected - 100% of overt biases correctly filtered due to verbalization
2
9
498
30,168
cc: @a_karvonen @saprmarks @milesaturpin @EthanJPerez @OwainEvans_UK - your work on LLM fairness and CoT unfaithfulness directly inspired this. We extend to automated bias discovery.
10
9
388
20,560