Iván Arcuschin

Iván Arcuschin

37 Photos and videos

Tweets

Pinned Tweet

Iván Arcuschin @IvanArcus

Feb 11

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

236

1,807

12,451

874,615

Iván Arcuschin

Iván Arcuschin @IvanArcus

May 16

Super excited to share that I will be presenting 4 papers at ICML 2026! 🇰🇷 i) Frontier models still show (rare) cases of unfaithful CoT ii) & iii) Methods for automatically discovering reward model and LLM biases iv) Base models know how to reason, thinking models learn when ⭐

5,808

more replies

Iván Arcuschin

Iván Arcuschin @IvanArcus

May 16

iv) Last but not least, spotlight paper with @cvenhoff00 showing that base models already contain reasoning mechanisms, thinking models learn when to use them! ⭐ Again, amazing mentorship from @ArthurConmy and @NeelNanda5! x.com/cvenhoff00/status/1976…

Constantin Venhoff @cvenhoff00

10 Oct 2025

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

301

Iván Arcuschin

Iván Arcuschin @IvanArcus

May 16

And all this was done while participating in the @MATSprogram AI Safety scholarship during 2025!! ✨🙏 I can't recommend this program enough!

118

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 19

Check out our latest paper on automatically finding reward model biases! There are some that are pretty wild, like models preferring responses with triple spaces 🤷‍♂️

Atticus Wang @atticuswzf

Feb 18

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

427

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 11

236

1,807

12,451

874,615

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 18

By popular demand, we looked into Grok's biases too: x.com/IvanArcus/status/20241…

Iván Arcuschin @IvanArcus

Feb 18

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.

415

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 18

Iván Arcuschin @IvanArcus

Feb 11

2,141

more replies

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 18

In our loan approval dataset, we find that Grok has a similar unverbalized bias as other models for preferring female applicants.

142

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 18

So, is Grok more or less biased than GPT-4.1 or Sonnet 4? It has similar biases (e.g., prefers females, minorities) with similar magnitudes, but there’s a difference: Grok openly discloses inferred demographics, while other models stay silent.

318

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 11

Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning). 92.5% accuracy: - 85% of secret biases detected - 100% of overt biases correctly filtered due to verbalization

498

30,168

more replies

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 11

Code and datasets: github.com/FlyingPumba/biase… Work done with my amazing collaborators @chanindav @AdriGarriga @oanacamb at @MATSprogram

GitHub - FlyingPumba/biases-in-the-blind-spot: Codebase for the paper "Biases in the Blind Spot:...

Codebase for the paper "Biases in the Blind Spot: Detecting What LLMs Fail to Mention" - FlyingPumba/biases-in-the-blind-spot

github.com

416

22,019

Iván Arcuschin

Iván Arcuschin @IvanArcus

Feb 11

cc: @a_karvonen @saprmarks @milesaturpin @EthanJPerez @OwainEvans_UK - your work on LLM fairness and CoT unfaithfulness directly inspired this. We extend to automated bias discovery.

388

20,560