To our knowledge, this is the first benchmark to bring verifier-based evaluation (the paradigm behind progress in math, code, and DB agents) into a multi-turn social-strategic domain.
đź’ˇThe payoff: you can see where models break, not just whether they do.