Also very convenient for quick rough model comparison.
Evaluation quality is not as good as in benchmarking environments because of dynamic and poisoned system prompts, strongly differing harnesses and random sampling, but still good enough to estimate directions.
Spent too much on tokens.
Back to pasting context into chat apps like a caveman. 🥀