GitHub went down for ~70 minutes yesterday. Interestingly, the root cause was not a database (the usual suspect), but an auth was returning 401s. Although outages are not good, we as engineers can learn a thing or two from them. Here's a quick dissection...
So, about 15% of API traffic started getting "Unauthorized" responses for requests that were perfectly valid. The credentials were fine. But the 'infra' was lying. Here is the part that makes this interesting.
Every well-behaved HTTP client reauths when it receives 401. So thousands of apps did exactly what they were supposed to do - and that made things worse.
Every client getting a false 401 (root cause for 401 not mentioned yet) kicked off a token refresh, which piled more load onto an already struggling auth layer. Here is my key takeaway...
When a 401 comes back, we typically reauthenticate, and we should. But if we get 10 consecutive 401s on a token that was just refreshed, reauthenticating again is not the answer. That is a circuit-breaker moment - back off, raise an alert, and stop hammering the system.
Retrying blindly in an auth-failure loop could turn an incident into a full outage. So, this is something you can account for when building your next system :)
Hope this helps.