Run ProgramBench by
@jyangballin @OfirPress @KLieret with any agents you want with
@benchflow_ai
SWE-Bench is my starting point to running and learning about benchmarks. My first principles of a good benchmark is that good benchmarks should 1) reflect or predict how agents or models are used in real life and 2) be challenging for sota agents at the time at release.
SkillsBench got massive success as it predicted the fundamental thing that agents will be deployed heavily in other domains. Remember the famous bar charts by Anthropic, we went earlier than that. Another thing it got right is that people will use skills to enable that deployment. Similarly, SWE-Bench is a good example as it predicted agentic coding. Terminal bench good example of showcasing power of terminal based harness. ProgramBench recently launched is interesting as it aims to predict agent generating whole repos from specs.
For ProgramBench's case I heard people wanted to 1) customize the agent harness, 2) customize initial prompts and 3) customize verifiers. They are all doable now in benchflow.