Your CI says WHAT broke. Testhide says WHY. 8 AI models diagnose every build โ€” root cause, flakiness, log patterns. Self-hosted CI/CD.

Joined May 2026
6 Photos and videos
Pinned Tweet
Your AI agent passed 41 unit tests. The eval suite was green. Your teammate merged and went home. By Monday it was quietly denying refunds it should approve โ€” in ~1 of 20 real conversations. No error. No alarm. Nothing "failed." Why green CI lies about agents ๐Ÿงต
1
1
23
Google ran the numbers: 84% of their test failures that flip from passโ†’fail are flaky โ€” not real bugs. Flaky tests are the tax almost every CI pays and almost no CI fixes. What flaky tests actually cost, and how to kill them โ€” a thread ๐Ÿงต
1
1
1
10
The better question isn't "re-run or mute?" โ€” it's "is this red signal or noise?" Testhide answers it with a Flakiness Predictor: gradient-boosted trees over 41 features, scoring each failure โ€” its history (top signal: recent fail-rate) build context. Real regression โ†’ block the PR. Noise โ†’ smart-quarantine. ๐Ÿ‘‡
1
1
16
Smart-quarantine = the test still runs and reports, it just doesn't gate the merge โ€” so coverage stays, red stays trustworthy. And we watch the model's own drift (PSI) so it doesn't rot. Not blind rerun. Not raw history. Prediction, in-pipeline, self-hosted. โ†’ testhide.com Tests fail. AI explains why.
7
Your CI is green. Your ML models are quietly going wrong anyway. A model trained on last quarter's data drifts as the real world shifts under it โ€” accuracy decays, retraining never fires, and not one test turns red. The blind spot almost no CI/CD covers ๐Ÿงต
1
1
1
15
Why Testhide is different: โ€ข plain CI (Jenkins, GitHub Actions, TeamCity) โ†’ no model monitoring at all โ€ข eval tools (Braintrust, Langfuse) โ†’ a separate dashboard, not your pipeline, no per-model input drift or retrain trigger We run it in-pipeline, per real input, self-hosted.
1
4
The payoff: a red card is the model that actually drifted โ€” and only that one retrains. No fan-out, no wasted GPU, no silent decay. Drift monitoring is only as good as the signal you measure. So measure the real one, per model. Self-hosted, runs daily โ†’ testhide.com Tests fail. AI explains why.
1
15
Your AI agent passed 41 unit tests. The eval suite was green. Your teammate merged and went home. By Monday it was quietly denying refunds it should approve โ€” in ~1 of 20 real conversations. No error. No alarm. Nothing "failed." Why green CI lies about agents ๐Ÿงต
1
1
23
But your golden set goes stale the day you freeze it. So the same models keep watching prod. An OOD detector flags inputs that look like nothing it's seen โ†’ you turn that into a fixture โ†’ the next PR's eval defends against a failure your users found for you. The loop closes.
1
6
Honest results from our own pipeline (no invented logos): ๐ŸŸข silently-wrong outputs now blocked at the PR ๐ŸŸข 2 of our bigger incidents would've been caught earlier by OOD Eval-in-CI post-deploy monitoring, one system, 100% self-hosted. docker compose up โ†’ testhide.com Tests fail. AI explains why.
6