Some updates to Spiral Bench:
- A more detailed rubric for protective vs delusion-reinforcing behaviours
- Responses evaluated by a judge ensemble: sonnet-4.5, gpt-5 & kimi-k2
- New models evaluated: qwen3-235b, glm-4.6, grok-4-fast, mistral-medium-3.1