not all product experiments are the same. here's how i see it
1/ some can be decided before production:
tests, benchmarks, SQL execution, model evals.
2/ some need production, but still have hard truth:
latency, cost/request, payment cleared, ticket resolved.
3/ some need expert judgment:
human labels, LLM judges, rubrics, golden datasets.
4/ and some need real user behavior:
edits, regenerates, thumbs-down, CSAT, conversion.
we are working on making the upcoming
@evo__hq platform understand these modes natively.
not just for “agent evals” or "code"
but for code, infra configs, prompts, workflows, model choices, routing policies, thresholds, graders, reward functions, RAG configs, and more.
the core idea is simple:
evo does not care whether the variant is code, a prompt, a model, a threshold, or a workflow.
evo is being built as the experimentation layer for the upcoming autoresearch wave.
as more agents start proposing changes across code, prompts, models, workflows, and infra, teams will need a common system to run the experiments, measure outcomes, apply guardrails, and decide what actually improved.
evo's platform is begin positioned against this upcoming wave