The Night the Ledger Held Its Breath
2:17 AM, Tuesday. The payments cluster was quiet. Too quiet, honestly.
Buried in our Spring Boot mesh sat a service called Ledger Orchestrator, the middleman for high-value transactions between risk engines, fraud checks, and settlement. It made most of its calls through Spring's RestClient.
That night, a downstream fraud service quietly started dragging its feet. No 500s, No alerts, No pager, just responses that took a little longer, and then a little longer than that.
The Orchestrator didn't know any better. It kept dialing out, kept opening sockets, kept trusting the network. The catch? Nobody had ever set a connect timeout on the client.
RestClient was happy to sit on a half-open TCP handshake more or less forever. Threads stacked up. The pool jammed. CPU looked fine on the dashboards, which was the cruel part. From the outside everything looked healthy but inside? the service was running out of air.
By 2:26 the latency graphs were vertical. A $2M transfer was just stuck. Retries kicked in, because retries always do, and each one spawned another connection that was also going to hang.
The system wasn't failing fast, it was failing in slow motion, and not telling anyone.
Ralph got the page. He'd seen this shape before. No exception storms, just quiet paralysis. He opened the config, scrolled for thirty seconds, and there it was: read timeout set, connect timeout missing. The service was technically allowed to wait forever before it even started talking.
Fix on the spot? Connect 2s, Read 5s. Retries capped with backoff. Rolling restart.
Almost instantly, hung calls started failing cleanly, circuit breakers tripped, load shed itself and the platform exhaled.
A connect timeout isn't a nice-to-have. It's the line where your service admits the network might not be there. Skip it, and you're assuming every hop between every region and every cloud is going to behave. In finance, that assumption will bite you. It's just a question of when.
RestClient is a nice piece of engineering. It won't save you from your own defaults.
Ralph's writeup the next morning had one line everyone remembers:
"Systems don't fail because of big bugs. They fail because of small waits."
In distributed systems, the scariest thing isn't an error in the logs. It's silence.