🚨 How to train LLMs to be persuaded *the right amount* (i.e., not be over-stubborn nor over-gullible)?
Balancing: 1⃣ defending against negative persuasion (makes answers worse / misinfo, jailbreaking, etc.) with 2⃣ accepting positive persuasion (makes answers better)? ➡️➡️➡️ Our multi-agent method, Persuasion-Balanced Training (PBT) recursively creates positive/negative persuasion RLHF data from generated dialogue trees and then trains LLMs to be persuaded when appropriate!
Across 3 models of varying sizes, PBT:
-- improves resistance to misinformation
-- reduces flipflopping
-- obtains best performance on balanced data
-- makes models better teammates
-- improves robustness to ordering sensitivity in multi-agent discussion!
👇👇👇
🚨 Excited to announce “Teaching Models to Balance Resisting and Accepting Persuasion”
LLMs need to be able to reject negative persuasion (e.g. misinfo, jailbreaking, etc)... BUT we argue that they also need to accept positive persuasion (e.g. when they are wrong/unsafe) to be helpful!
We take a 1st step in balancing: 1⃣ defending against negative persuasion (makes answers worse) with 2⃣ accepting positive persuasion (makes answers better), training LLMs to accept persuasion when appropriate.
Our multi-agent method, Persuasion-Balanced Training (PBT) recursively creates RLHF training data from generated dialogue trees and then trains models on positive/negative persuasion.
Across three models of varying sizes, PBT
-- improves resistance to misinformation
-- reduces flipflopping
-- obtains best performance on balanced data
-- makes models better teammates/robust to ordering in multi-agent discussion
When pairing 2 LLMs in a multi-agent debate, we find that pairs generally have a lot of order-dependence. Depending on whether the stronger or weaker model goes first, the team lands on the stronger or weaker answer. PBT reduces this variability and improves team performance.
🧵👇