The 9th Nine

The Retry Storm: How We Took Down Our Own Auth Service

#production#reliability#distributed-systems#post-mortem

It was 2:00 AM on a Tuesday. The on-call pager went off with a high-urgency alert: Auth Service Latency > 5000ms.

By the time I opened my laptop, the dashboard was a sea of red. The Auth Service wasn’t just slow; it was dead. 500 errors were spiking, and CPU usage across all pods was pinned at 100%.

“Just scale it up,” was the first instinct. We doubled the replica count. The CPU usage on the new pods instantly hit 100%. We doubled it again. Same result. The database CPU was oddly low. The network throughput was massive.

We weren’t being attacked by a hacker. We were attacking ourselves.

The Architecture

Our system was fairly standard:

  1. Mobile Clients (~50k active)
  2. API Gateway (Routes requests)
  3. Auth Service (Validates tokens via Redis/DB)

The Investigation

We looked at the logs, but they were scrolling so fast they were unreadable. We sampled the traffic and noticed something terrifying: the Request Rate was 20x higher than normal peak traffic.

But the number of users hadn’t changed.

I grabbed a trace ID from one of the failed requests. I searched for it in the logs.

Our mobile client had a retry logic library installed. On any 5xx error, it would retry 3 times.

The Root Cause: The Thundering Herd

The incident started with a minor network blip that caused the DB to stall for 500ms. This caused a few hundred requests to timeout and return 500s.

The clients saw the 500s and immediately retried. This doubled the load on the already struggling service.

Because the service was struggling, those retries also failed. So the clients retried again. And again.

We had created a positive feedback loop. The more the server failed, the more traffic the clients sent. We had accidentally built a Distributed Denial of Service (DDoS) cannon and pointed it at our own face.

The Fix: Jitter and Circuit Breaking

We couldn’t fix the code on 50,000 phones in real-time. We had to stop the bleeding at the gateway.

Step 1: Circuit Breaking (Immediate) We configured the API Gateway to “short circuit” requests to the Auth Service. Instead of trying to forward the request and consuming resources, it immediately returned a 503 Service Unavailable. Behavior:

This allowed the Auth Service to catch its breath. The queues drained. CPU dropped. We slowly closed the circuit.

Step 2: Jitter (Long Term) The client retry logic was:

wait_time = base_delay * (2 ** retry_count) # Exponential Backoff

This spreads out load, but if 1,000 clients fail at the same time, they all retry at t+1s, then t+2s, then t+4s. They are still synchronized.

We changed the clients to use Jitter:

import random
temp = min(cap, base * 2 ** retry_count)
sleep = temp / 2 + random.uniform(0, temp / 2)

This desynchronizes the herd.

The Lesson

Resilience isn’t just about server capacity or code efficiency. In a distributed system, your clients are part of your infrastructure.

If you don’t control how your clients behave when things go wrong, you can’t guarantee system stability. Always use Jitter with your Retries, and aggressive Circuit Breakers on your upstream.