A protocol we operate infrastructure for had a 9.5-hour outage last quarter.
The actual fix took a few minutes. The rest of the nine was an alert that sat unread from 00:30 until someone saw it at 09:00. The technical cause: 250GB of old database schemas accumulated over months. A cleanup script existed. It just hadn't been scheduled.
Both root causes converge to one missing branch in the decision tree: who owned this, by when, with what threshold.
Engineering rarely fails at engineering. It fails at ownership boundaries nobody drew. Code shipped without a clear answer to "who pages at 00:30?" is code preserving the absence of that decision.
That's the work a Sprint Zero exists to surface. Before it shows up at 00:30.
1/ A client on working with us:
"The abyss between a supplier that delivers what you asked for, and one that asks why you want that in the first place, understands what you actually need, and builds that instead."
3/ We can't work like that. Half the time the most useful thing we do early is push back on what we were hired to build.
The answer might be smaller, a different shape, or upstream of a problem you haven't named yet.
4/ That feels like friction at the start. It saves you from the version you'd be paying to rebuild a year in.
If the thing you want built feels like a symptom, the real project is probably a different one.
4/ Most of what software costs lives in the gap. Not writing the code. The second pass nobody scoped, the one that turns "it works" into "it works for people who actually use it."
Most people treat Discovery as prep work.
What it actually produces is a decision on what to build, why, and what the architecture has to support.
Skip it, and those decisions still get made. Just later, under pressure.
You can ship on time, follow every requirement, and still miss the point entirely.
The spec was right. The build was clean. The problem is still there.
What changed: a team that thought about what the feature meant for the whole platform before touching the code.
Not after the scope was locked. Before.
Most teams don't fail because of bad engineers or bad PMs.
They fail at the boundary between the two, where nobody is paying attention!
Full post below 👇