Cool new work led by Bilal:
Interp researchers have tried complicated techniques to find differences between two models, but we find just asking an agent works well.
This is part of a sequence of posts we're releasing describing recent work on our team; stay tuned for more!
New research update from the Google DeepMind Language Model Interpretability team.
We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.