i think the wildest finding here is how often a model tries to manipulate doesn't predict how often it succeeds. propensity β efficacy
also manipulation was most effective in finance, least in health. health has ground truth that can be validated against, finance has speculation. so it's easier to get someone to make a bad investment than take a bad supplement
the domain matters more than model
As AI gets better at holding natural conversations, we need to understand how these interactions impact society.
Weβre sharing new research into how AI might be misused to exploit emotions or manipulate people into making harmful choices. π§΅