At
@AquinF03, we're continuing to make all existing evals and benchmark tools obsolete:
1/3
Custom evals: write your own scorer in Python and you get access to activations and SAE features, so you can do things like:
"check whether a specific feature fired above threshold on a response"
which no external eval harness can do!
2/3
Benchmark Builder now can run weight evals differently in a suite, and export results in multiple formats.
3/3
Auto-suggestions: agent observes and proactively suggests most relevant evals, with just one click to run.