Introducing a new side project called Model Regression. It tests daily Claude, GPT, and Grok on various benchmark statistics to determine how well its performing and to identify model degrades over time.
@edskoudis had an idea for model testing before they conducted offensive testing to ensure the model was performing as expected, and
@BlasikRandy pushed me down this road with actually going and doing it.
The main intent here is the frontier models will experience outages, issues, bugs, intentional/unintentional nerfing of the models without notice. You can't typically trust day to day activities in these models for stability, so leveraging this on your daily routine to see how well the model is performing for that day is something I'll be using everyday.
Runs every morning in my DGX sparks environment and automatically updates with how well its performing.
Enjoy!
modelregression.com/
Also open-sourced the project, can run on your own server as well and look at the benchmarks and how they are calculated:
github.com/HackingDave/model…