DevOps / SRE Engineer · Threadcode Technologies Pvt. Ltd. · EventLogic, Swedish multi-tenant event-management SaaS · Lalitpur, Nepal
May 2023 · present- Owner of the entire AWS platform end to end: ECS Fargate, Aurora PostgreSQL, ElastiCache Redis, Amazon MQ in eu-north-1.
- Diagnosed and resolved a 19-minute full-platform outage caused by blocking Redis KEYS calls exhausting the Tomcat/JDBC thread pool. Added connection-pool checkout timeouts, tuned RDS parameters, then drove a 68-task reliability program across 11 epics and 7 sprints to prevent recurrence.
- Established a cross-region DR path where none existed: Aurora Global Database from eu-north-1 to eu-west-1, EFS and ECR replication, shared KMS keys, and a documented runbook for promotion.
- One Python API call sets up schema-per-tenant on Aurora, wires SQS and EventBridge, creates ALB listener rules, provisions a CloudFront / S3 distribution, configures Route 53 records, and registers the tenant in DynamoDB.
- Stood up OneUptime on a K3s cluster in a separate region (eu-central-1) for status pages, uptime monitoring, on-call scheduling and incident management. Deliberately off the primary region so observability survives a primary-region outage.
- Self-managed three-node Elasticsearch cluster orchestrated with Terraform and Ansible. Split deploy and split-restart playbooks so a single config change cannot cascade across the cluster.
- Cut monthly AWS spend by removing orphaned NAT Gateways, adding S3 and DynamoDB gateway endpoints to drop data-transfer cost, setting log-retention policies on CloudWatch, and right-sizing EBS volumes.
- Technical Reviewer for Python for DevOps (Packt).