We make coding agents explain themselves.

Joined July 2025
16 Photos and videos
Pinned Tweet
Coming soon.
1
334
“red team” sounds cool. “blue team” sounds cool. so you’d think “purple team” would be very cool. alas, it is a Slack thread with 19 unresolved comments.
1
36
Your token spend was a number you could've gated on. Instead it's a number you get to explain.
1
277
everyone's like "how big is your team" brother. it's one agent. it's opening PRs against itself. i haven't written code in four months. leave me alone
1
3
383
15 Nov 2025
the best AI coding assistant might be the one that works on a plane
288
25 Oct 2025
Every release is a high‑wire act. Instead of praying for calm winds, build a net. EvalOps ties your policies, metrics and audits into a mesh that lets you scale without falling.
1
666
22 Oct 2025
We open-sourced Nimbus – Firecracker-based CI for AI workloads. Multi-tenant isolation, RBAC, audit logs.
1
647
19 Oct 2025
EvalOps is where evaluations meet operations — and security is no exception. “keep” shows how device posture, SSO, and OPA policies can be continuously tested and traced like any other system. Run it, break it, measure it. github.com/evalops/keep
115
17 Oct 2025
Agents are already writing your code. The question isn't "should we use them?" It's "how do we ship them without surprises?" Provenance gives you a ledger. Every line. Every agent. Every risk. Measurable. github.com/evalops/provenanc…

1
1
1
539
15 Oct 2025
We’re open-sourcing Smith — the Firecracker-based CI runner that powers EvalOps. Why rebuild Blacksmith? Because eval gating needs specialized infra — and we’re not forcing you onto our cloud. Run evals on EvalOps Cloud or your own. github.com/evalops/smith

1
1
386
9 Oct 2025
I'm told we're doing awards now?
926
4 Oct 2025
Everyone wants to move fast. @EvalOpsDev makes sure you don’t break trust along the way. Governed AI releases start here.
Shipped a new home for @EvalOpsDev. No fluff, just governed AI releases. Check it out -> evalops.dev
183
2 Oct 2025
🔥 Just dropped an evaluation‑driven LoRA loop built on Tinker from @thinkymachines! It trains, benchmarks & iterates until your model meets the mark. It auto‑spots weaknesses, spawns targeted LoRA jobs & tracks improvements. Proof‑of‑concept repo: github.com/evalops/tinker-ev…

1
3
582
30 Sep 2025
Sick of yak-shaving to get a clean Transformers setup? We built a stack that just works: PyTorch HF Transformers Hydra configs FastAPI serving Prometheus vLLM, LoRA, flash-attn, bitsandbytes Reproducible. Dockerized. CI/CD baked in. github.com/evalops/stack

1
3
405
30 Sep 2025
Developer resumes are frozen in time. GitHub tells the real story. 7k commits, 1.4M lines → now that’s a holographic trading card worth flexing. 🚀 cards.evalops.dev

1
57
EvalOps retweeted
LLM vendor: “Just quantization.” Reality: reward-hacked code, broken workflows, lost week. Companies: “nbd.” Users: 🙃🔥 Making this a thing of the past.
1
2
425
27 Sep 2025
All of us have been dazzled by large language models’ ability to spit out code, fix bugs, or draft boilerplate. But when you put that code into production, every hidden bug is a potential outage, compliance fine, or security hole. And today’s AI tools leave you guessing.
1
3
477
27 Sep 2025
This transforms AI codegen from a toy that produces drafts into a partner you can trust to do real work.
1
1
197
27 Sep 2025
Interested? DM for early access.
181