There are few moments in my professional career where I have stopped to take a look around and say "wow, we really did that". Today is such a day. Evaluation Cards, a project that I've been working towards for more than two years now, out in beta today, and it's yours now.
Back in 2024, in a tiny room in Vancouver at the Neurips
@evaluatingevals workshop, I remember expressing frustration at the broken eval landscape: people were releasing benchmarks with no reproducibility methods, they were scattered across leaderboards, 200 page system cards, and other paraphernelia. We realized it is a problem and we formed this wonderful coalition to try and solve that.
Two years later, we have 500 members, a common schema to report evaluations in (that is still being hotly debated and evolving, but that's what nerds do) and finally, we have a tool that we hope the actors across the evals research and technical governance ecosystem can use to confidently consume and work on evals. 100K eval results and several design iterations later, this feels like a good time to open it up to the world.
This is only the beginning! This is a community effort, and for this to be a long running sustainable thing, we need help! If you are a model developer who has eval results in your system card, please send us your eval runs and they will automagically appear on the eval card for your model with interactive, embeddable plots and stats! If you are an evaluator of models and are already releasing detailed eval reports for models, send us your eval runs via your official Hugging Face account and we will list you as a verified evaluator and show your results on the leaderboards and model pages within Eval Cards! We have worked hard on a process that is fairly automated, so once you develop an adapter, we can just auto pull from you every time you release new evals, and they all appear in one single place.
We have a ways to go! Hopefully this standardization work helps the community. We can't wait to hear what you think :D
A lot of incredible people were involved in this work, who I will run out of space to tag here, but want to specifically call out the co-leads on the work -
@AnkaReuel, Jenny Chim, Wm Matthew Kennedy, PhD, and my amazing co-hosts at EvalEval
@IreneSolaiman @BlancheMinerva and
@ZeerakTalat, who created a home for this work and who make doing eval science in the public interest such a fun job. 💙
evalevalai.com/infrastructur…