Sonia Murthy

Sonia Murthy

17 Photos and videos

Tweets

Sonia Murthy @soniakmurthy

7 Dec 2025

Excited to be presenting our work on using cognitive models to interpret pluralistic values in LLMs once again as a spotlight talk 🌟 at the NeurIPS CogInterp workshop! Come by upper level room 5AB today and check out the paper here: arxiv.org/abs/2506.20666

Cognitive models can reveal interpretable value trade-offs in...

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are...

arxiv.org

CogInterp Workshop @ NeurIPS 2025 @CogInterp

7 Dec 2025

Replying to @CogInterp

The spotlight talks will cover all aspects of interpreting cognition in deep learning models: from behavior to algorithms to representations! Also check out the list of poster presentations at coginterp.github.io/neurips2… (3/3)

1,005

Sonia Murthy

Sonia Murthy @soniakmurthy

1 Dec 2025

bruce is great at making research resources and this one has been a huge help for my human studies in the stream! check it out ✨

Bruce W. Lee

@BruceWLee2

30 Nov 2025

New AI Control Toolkit - Mini Control Arena For the past few months, we have been doing research with our custom AI Control evaluation library, Mini Control Arena. Mini Control Arena is a ground-up rewrite of UK AISI Control Arena for a much simpler code structure. We are open-sourcing the codebase and hope it helps with your experiments, too! github.com/brucewlee/mini-co…

370

Tomek Korbak

Sonia Murthy retweeted

Tomek Korbak

@tomekkorbak

30 Nov 2025

My rockstar MATS mentee @BruceWLee2 has just open-sourced his sleek and elegant codebase for AI control research, ppl should give it a try!

Bruce W. Lee

@BruceWLee2

30 Nov 2025

102

12,909

Eric Bigelow

Sonia Murthy retweeted

Eric Bigelow @EricBigelow

11 Nov 2025

📝 New paper! Two strategies have emerged for controlling LLM behavior at inference time: in-context learning (ICL; i.e. prompting) and activation steering. We propose that both can be understood as altering model beliefs, formally in the sense of Bayesian belief updating. 1/9

137

33,899

Kushin Mukherjee

Sonia Murthy retweeted

Kushin Mukherjee @kushin_m

21 Oct 2025

Zach did a stellar job on our new paper looking at what recipes make for language models that are representationally aligned with humans! Read his tweetprint and recruit him for grad school!

Zach Studdiford @ZachStuddiford

21 Oct 2025

We’re drowning in language models — there are over 2 mil. of them on Huggingface! Can we use some of them to understand which computational ingredients — architecture, scale, post-training, etc. – help us build models that align with human representations? Read on to find out 🧵

1,544

Sonia Murthy

Sonia Murthy @soniakmurthy

9 Oct 2025

Excited to present our new paper as a spotlight talk 🌟 at the Pragmatic Reasoning in LMs workshop at #COLM2025 this Friday! 🍁 Come by room 520B @ 11:30am tomorrow to learn more about how LLMs' pluralistic values evolve over reasoning budgets and alignment 🧵

10,754

more replies

Sonia Murthy

Sonia Murthy @soniakmurthy

9 Oct 2025

We also trace the evolution of value trade-offs during alignment by evaluating model checkpoints for 8 unique base model x feedback dataset x alignment algorithm. We see the largest shifts in values early on in training, with strongest effects of base model choice.

270

Sonia Murthy

Sonia Murthy @soniakmurthy

9 Oct 2025

Thanks to my lovely collaborators @rosieyzh, @_jennhu, @ShamKakade6, @m_wulfmeier, Peng Qian, and @TomerUllman and the Kempner Institute! 🧠 [end]

218

Apoorv Khandelwal

Sonia Murthy retweeted

Apoorv Khandelwal @apoorvkh

6 Oct 2025

In our new paper, we ask whether language models solve compositional tasks using compositional mechanisms. 🧵

180

14,786

Sonia Murthy

Sonia Murthy @soniakmurthy

1 May 2025

Presenting this today (5/1) at the 4pm poster session (Hall 3) at #NAACL2025! Come chat about alignment, personalization, and all things cognitive science 🐟

Sonia Murthy @soniakmurthy

10 Feb 2025

(1/9) Excited to share my recent work on "Alignment reduces LM's conceptual diversity" with @TomerUllman and @jennhu, to appear at #NAACL2025! 🐟 We want models that match our values...but could this hurt their diversity of thought? Preprint: arxiv.org/abs/2411.04427

834

Kempner Institute at Harvard University

Sonia Murthy retweeted

Kempner Institute at Harvard University @KempnerInst

10 Feb 2025

NEW blog post: Do modern #LLMs capture the conceptual diversity of human populations? #KempnerInstitute researchers find #alignment reduces conceptual diversity of language models. Read more: bit.ly/4hNjtiI @soniakmurthy @tomerullman @_jennhu

5,106

Sonia Murthy

Sonia Murthy @soniakmurthy

10 Feb 2025

7,213

more replies

Sonia Murthy

Sonia Murthy @soniakmurthy

10 Feb 2025

(9/9) Code and data for our experiments can be found at: github.com/skmur/onefish-two… Preprint: arxiv.org/abs/2411.04427 Also, check out our feature in the @KempnerInst Deeper Learning Blog! bit.ly/417WVDL

GitHub - skmur/onefish-twofish: One fish, two fish, but not the whole sea: Alignment reduces...

One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity (NAACL 2025) - skmur/onefish-twofish

github.com

289

Sonia Murthy

Sonia Murthy @soniakmurthy

10 Feb 2025

Many thanks to my collaborators and @KempnerInst for helping make this idea come to life!🌱

588