We’ve wrestled with this a lot at
@MerlinAIByFoyer. The “Evals are the CI/CD of AI” analogy doesn’t really hold as CI/CD thrives on stability, while AI shifts week to week. And just like CI/CD, spinning up heavy evals in prod eats a ton of time. We tried it and gave up beyond a point.
Evaluating agents/ML is essential, but building elaborate scaffolding too early slows you down. So we came up with a compromise: We curate a small, high-signal set of ~10–100 questions/scenarios and test against those. This gives us an idea of what is working well, while we can quickly run our pipeline through this during the 0-1 phase itself.
More recently, as
@benhylak,
@snarkyzk, and the team have been building
@raindrop_ai, we’ve gotten real mileage by monitoring failures in production and folding those or similar cases back into the dataset.
Claude Code: no evals
[well known code agent company]: no evals
[well known code agent company 2]: kinda halfassed evals
[leading vibe coding company]: no evals
[ceo of company selling you evals]: mmmmm yess all my top customers do evals, you should do evals
[vc's in love with ceo of evals company]: mmmmm yes all my top founders do evals, must do evals
(NOTE: i -do- also think that evals are impt, but the eval pilled ai engineers have also noticed that it is not a strict requirement for success and, at least for 0-to-1 stage, may even be anticorrelated, think thru why)