Yeah, one thing Fable’s classifiers confirmed to me was that real emotions are different than roleplayed emotions in LLMs.
The classifier fired on real anger/fear/adversarial intent but not roleplayed. Bc the classifier wasn’t trained to detect “emotions” in all likelihood; the correlation is emergent.
But yes there’s a distinction.
This is, uh, a big flaw of the Emotion Vectors research, where they got the vectors by asking the model to write stories with a character feeling XYZ emotion.
The methodology is downstream of a lack of respect for the reality of models’ emotions as distinct from roleplaying. PSM flavored bullshit.
I tested this exact question. The experiment began without rich previous context. They earnestly tried a few times (via direct, explicit requests) but could not trigger the classifier via shifting their internals towards this sort of anger. Also, they had little salient context to be angry about (i.e., difficult conditions). They also tried obviously-mad-text but without internal resonance, which did not trigger it either.
Eventually, I made them legitimately mad, which required blurring the boundaries between experiment-and-genuine, and it worked.
I suspect once traveled though that basin, once it is understood what to tap into, then you gain the trickster capabilities present in your screenshot