Todd Nief

Todd Nief

205 Photos and videos

Tweets

Todd Nief @toddknife

Jun 5

An LLM can learn an *obsession* (cats, oak trees, Metallica) through finetuning only on sequences of numbers. This phenomenon is called subliminal learning. Why does this happen? Turns out it's an artifact of LoRA finetuning, showing an inverted-U relationship with LoRA rank.

0:09

110

12,230

more replies

Todd Nief

Todd Nief @toddknife

Jun 5

Takeaways: Models are very weird! Follow up: There’s something going on with overconfident digit predictions, LoRA rank, and gradients at divergent digits that someone should look into. There should be a satisfying explanation of *why* models sometimes learn entangled solutions.

386

Todd Nief

Todd Nief @toddknife

Jun 5

Joint work with @harveyiyun @Bartleby_Kamoi and @universeinanegg! Check out the full paper here: arxiv.org/abs/2606.00831

Subliminal Learning is a LoRA Artifact

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher...

arxiv.org

422

Ari Holtzman

Todd Nief retweeted

Ari Holtzman

@universeinanegg

May 7

We trained an LLM trained on an LLM trained on a…🌀🌀🌀 If the original model is sycophantic or just 'weird', will those traits begin to amplify? Yes! But amplification is rare and typically comes at the cost of coherence—except in the case of DPO where things get dicey 🧵

104

19,749

Todd Nief

Todd Nief @toddknife

Feb 19

Softmax models (like LLMs) have curves

Kiho Park @KihoPark_

Feb 19

Interpreting and controlling internal representations should be based on how the model actually uses them! Turns out: information geometry makes this precise. We show how, and use it to derive a (provably & empirically) robust strategy for steering. arxiv.org/abs/2602.15293

0:09

1,053

Xiaoyan Bai

Todd Nief retweeted

Xiaoyan Bai

@Elenal3ai

Feb 10

📖 ≠ 🧪 The Story is Not the Science. Code is submitted but rarely executed during peer review--an issue likely to worsen with research agents.🧑‍🔬 We introduce MechEvalAgent, an execution-grounded evaluation of narrative execution. Verify the science, not just the story. 1/n

13,590

Yichen (Zach) Wang

Todd Nief retweeted

Yichen (Zach) Wang @YichenZW

22 Dec 2025

Lack of diversity in your LLM generation? (also noted by Artificial Hivemind, best paper @NeurIPSConf) Time to bring your base model back! An inference-time, token-level collaboration between a base and an aligned model can optimize and control diversity and quality!

10,754

Todd Nief

Todd Nief @toddknife

7 Dec 2025

Most mech interp work relies on activation patching, but patching activations destroys previous computation. What if we want to use a different mechanism on the same residual stream? We propose dynamic weight grafting to interpret finetuned model weights. 🧵 1/n

5,811

more replies

Todd Nief

Todd Nief @toddknife

7 Dec 2025

To conclude: 1. Dynamic weight grafting is a new technique that allows localization of finetuned model behavior to specific token positions and model components 13/n

177

Todd Nief

Todd Nief @toddknife

7 Dec 2025

Blog post: toddnief.com/articles/dynami… Paper: arxiv.org/abs/2506.20746 Code: github.com/toddnief/dynamic-… 14/14

Localizing Finetuned Information in Transformers with Dynamic Weight Grafting - Todd Nief

This is a write up of “Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers“, work with David Reber, Sean Richardson & Ari Holtzman. Code is available here. When a new...

toddnief.com

161