Postmortem · 2024 · production

Redis KEYS and the 19-minute outage

A 19-minute full-platform outage, traced to blocking Redis KEYS calls exhausting the Tomcat/JDBC thread pool. What happened, what fixed it, and the reliability program that followed.

Aurora · Redis · JDBC · Postmortem · SLOs

Context

The platform is a Swedish multi-tenant event-management SaaS with customers across Europe: ECS Fargate services behind ALB, Aurora PostgreSQL with schema-per-tenant, ElastiCache Redis, Amazon MQ, and tenant routing via a DynamoDB registry. I run it as sole platform owner, so the person diagnosing this outage and the person accountable for the platform were the same person.

Impact

In 2024 the platform took a 19-minute full-platform outage in production. Not one slow endpoint, not one degraded tenant: the whole platform, for 19 minutes.

Root cause

The trigger was blocking Redis KEYS calls. KEYS is a blocking, full-keyspace scan, so every caller behind one waits until it finishes. Those waits backed up into the application tier and exhausted the Tomcat/JDBC thread pool: requests held threads while they waited, new requests found no threads left, and the platform stopped answering.

Resolution

Two fixes went in: connection-pool checkout timeouts, so a request that cannot get a connection fails fast instead of holding a thread indefinitely, and tuned RDS parameters.

Follow-through

The incident fix was the small part. From the postmortem I drove a 68-task reliability program across 11 epics and 7 sprints to prevent recurrence.

Lessons

  • A cache is a production dependency, not a side detail. One blocking command pattern in Redis was enough to take the whole platform down.
  • Bound every wait. The durable part of the fix was timeouts at the connection-pool checkout boundary: a slow dependency should fail one request, not absorb the entire thread pool.
  • The fix is not the postmortem. Preventing recurrence took a structured program (68 tasks, 11 epics, 7 sprints), not a single patch.

Every fact above (the 19 minutes, the KEYS calls, the thread pool, the 68 tasks) comes from the same data file that drives the rest of this site.

Sagar Budhathoki

Senior DevOps / SRE Engineer

Sole owner of the platform in this story. The plain-text career surface is at /work, and the interactive version lives at sagarbudhathoki.com.