Postmortem · 2024 · production
Redis KEYS and the 19-minute outage
A 19-minute full-platform outage, traced to blocking Redis KEYS calls exhausting the Tomcat/JDBC thread pool. What happened, what fixed it, and the reliability program that followed.
Aurora · Redis · JDBC · Postmortem · SLOs
Context
The platform is a Swedish multi-tenant event-management SaaS with customers across Europe: ECS Fargate services behind ALB, Aurora PostgreSQL with schema-per-tenant, ElastiCache Redis, Amazon MQ, and tenant routing via a DynamoDB registry. I run it as sole platform owner, so the person diagnosing this outage and the person accountable for the platform were the same person.
Impact
In 2024 the platform took a 19-minute full-platform outage in production. Not one slow endpoint, not one degraded tenant: the whole platform, for 19 minutes.
Root cause
The trigger was blocking Redis KEYS calls. KEYS is a blocking, full-keyspace scan, so every caller behind one waits until it finishes. Those waits backed up into the application tier and exhausted the Tomcat/JDBC thread pool: requests held threads while they waited, new requests found no threads left, and the platform stopped answering.
Resolution
Two fixes went in: connection-pool checkout timeouts, so a request that cannot get a connection fails fast instead of holding a thread indefinitely, and tuned RDS parameters.
Follow-through
The incident fix was the small part. From the postmortem I drove a 68-task reliability program across 11 epics and 7 sprints to prevent recurrence.
Lessons
- A cache is a production dependency, not a side detail. One blocking command pattern in Redis was enough to take the whole platform down.
- Bound every wait. The durable part of the fix was timeouts at the connection-pool checkout boundary: a slow dependency should fail one request, not absorb the entire thread pool.
- The fix is not the postmortem. Preventing recurrence took a structured program (68 tasks, 11 epics, 7 sprints), not a single patch.
Every fact above (the 19 minutes, the KEYS calls, the thread pool, the 68 tasks) comes from the same data file that drives the rest of this site.