This week we had the pleasure of
@romovpa presenting "Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs"
arxiv.org/pdf/2603.24511
He showed us how auto-research (i.e. Claude Code in a loop, equipped with a simple benchmark) can discover attack algorithms on LLMs better than hand-crafted SOTAs. "Attack" here is a prefix added to a prompt that would lead to a fixed string in the output of an LLM with high probability; LLM is a white box with known weights.
Some takeaways:
1. Without seeding this autoresearch with multiple hand-written attacks, it does not work — it combines ideas, but does not come up with novel ideas.
2. Autoresearch does much better than hparam search only.
3. Plenty of options for reward hacking — need to design benchmark with autoresearch in mind.
4. Kimi performed no worse than Claude or Gemini in this task.
5. Always useful to run autoresearch if you have a benchmark to optimise as it is so low effort and powerful.