I’ve been building this lane for months. A suite of 200 prompts measuring behavioral drift under pressure, paraphrase, persona, multi-turn. 7 dimensions, cross model runs on sonnet/haiku/gemini/deepseek/grok
Capability benches tell you what a model can do, but nobody measures what it stops doing under stress (¬‿¬)
We need more niche benches.
We need ios-bench.
We need ts-bench.
We need baseball-bench.
We need yt-thumbnail-bench.
We need way more creativity in how we measure what models can do.