Submit your work! The 2nd Workshop on ๐๐๐ญ๐ข๐จ๐ง๐๐๐ฅ๐ ๐๐ง๐ญ๐๐ซ๐ฉ๐ซ๐๐ญ๐๐๐ข๐ฅ๐ข๐ญ๐ฒ will be held at COLM 2026 in San Francisco!
Submission Deadline: June 21, 2026
@ActInterp
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
๐งต
๐คWhat happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?
๐ New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs๐๐งต
Many thanks to the @ActInterp organisers for highlighting our work - and congratulations to Pedro, Alex and the other awardees! Sad not to have been there in person, it looked like a fantastic workshop. @AmsterdamNLP@EdinburghNLP
Big congrats to Alex McKenzie, Pedro Ferreira, and their collaborators on receiving Outstanding Paper Awards!๐๐
and thanks for the fantastic oral presentations!
Check out the papers here ๐
Big congrats to Alex McKenzie, Pedro Ferreira, and their collaborators on receiving Outstanding Paper Awards!๐๐
and thanks for the fantastic oral presentations!
Check out the papers here ๐
1โฃDetecting High-Stakes Interactions with Activation Probes - arxiv.org/abs/2506.10805
2โฃ Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations - arxiv.org/abs/2504.05294
Great to present whatโs coming next for NDIF at the @actinterp workshop at #ICML2025!
If you missed us, letโs chat after the conference. Reach out here: forms.gle/AhTSBNNttA11JVNS6