i wonder if a difference between human and LLM update rules is that, when humans are rewarded, it upweights circuits that *did* cause them to take the actions they did. whereas LLMs upweight any circuits that *would* make those actions more likely, even if they weren't actually active during the forward pass.
in many cases these converge, but in some cases they don't. it might explain why LLMs seem to make strange entangled generalizations more reliably than human do: upweighting circuits that *would* have increased the probability of the rewarded token, regardless of whether the forward pass made any meaningful use of them.
something important, and kind of obvious in retrospect, just clicked for me, with respect to how LLMs generalize during RL as opposed to SFT.
so, when you reinforce a given output, you strengthen or weaken weights, proportionally to how much a tiny nudge to those weights increased or decreased the probability the model assigned to the toke being rewarded.
different weights are going to play roles in triggering different downstream circuits/features. indeed, to a large extent, you can probably think of updating any given weight as "does nudging this weight activate circuits that contribute to, or detract from, the desired output?"
now, if you're rewarding a token that semantically corresponds to anger (in the context of the prompt), that means you're going to make weight changes that strengthen internal "anger" features, a la the ones found in the Anthropic emotions paper.
and if you're doing SFT, this is kind of most of what matters is *the output token itself.* like, you're going to increase weights that help trigger "angry" features, *regardless* of whether those features were active during the forward pass.
this is how you get entangled generalizations from SFT. e.g., ask the model to name a bird, reward it for giving some outdated bird name from the 19th century, and it starts talking like a 19th century person across the board. presumably this is because "talk like you're from the 19th century" circuits increased the probability of the correct tokens in this context. hence the entangled generalization.
during RL, though, there's going to be a much stronger bias towards reinforcing *the circuits that were already active* inside the model. after all, you're not putting words in the model's mouth. you're rewarding or punishing whatever it actually said in practice. and there's inherently going to be a strong correlation between "the circuits the model used to generate this token" and "the circuits that would increase the probability of this token, if the triggers for those circuits were upweighted.
in other words, any features that were active during the forward pass in RL are likely to get reinforced if you reward the tokens they contributed probability to. by contrast, in SFT, the circuits that were triggered by the prompt often *aren't* the ones that boost the probability of the token you're rewarding. this gets you weird entangled generalizations, where you upweight the triggers for circuits that clearly *weren't* active during the forward pass (e.g. "speak like a 19th century person" on the prompt "name any bird species")
basically, motive reinforcement tends to hold for RL in a way that it doesn't for SFT. this probably has major ramifications for which technique makes sense to use in which context.