Sagar Budhathoki

Senior DevOps / SRE Engineer

5+ Years · Kathmandu, Nepal · open to remote

building ai agents for ops · open to remote senior roles

Selected work

eventlogic

multi-tenant SaaS · eu-north-1Running

Swedish multi-tenant event-management SaaS. Sole platform owner. ECS Fargate services behind ALB, Aurora PostgreSQL with schema-per-tenant, ElastiCache Redis, Amazon MQ. Tenant routing via DynamoDB registry. Customers across Europe.

Region
eu-north-1
Tenancy
schema-per-tenant

ECS Fargate · Aurora PostgreSQL · ElastiCache · Amazon MQ · DynamoDB · CloudFront

dr-failover

cross-region disaster recoveryRunning

Cross-region disaster recovery from eu-north-1 to eu-west-1. Built where none existed. Aurora Global Database for sub-second cross-region replication, EFS and ECR replication, shared KMS keys across regions. Documented runbook for promotion.

Primary
eu-north-1
Failover
eu-west-1

Aurora Global DB · EFS · ECR · KMS · Route 53

tenant-orch

provisioning serviceRunning

Python tenant-provisioning orchestrator. One API call sets up schema-per-tenant on Aurora, wires SQS and EventBridge, creates ALB listener rules, provisions a CloudFront / S3 distribution, configures Route 53 records, and registers the tenant in DynamoDB.

Per tenant
one call
Steps
6+

Python · FastAPI · Aurora · SQS · EventBridge · Route 53

reliability

incident · 19m outage fixRunning

Diagnosed a 19-minute full-platform outage caused by blocking Redis KEYS calls exhausting the Tomcat/JDBC thread pool. Added connection-pool checkout timeouts, tuned RDS parameters, and drove a 68-task reliability program across 11 epics and 7 sprints to prevent recurrence.

Outage
19 min
Tasks
68 / 11 epics

Aurora · Redis · JDBC · Postmortem · SLOs

oneuptime

self-hosted SRE platformRunning

Self-hosted OneUptime on K3s in eu-central-1 (separate region from primary). Status pages, uptime monitoring, on-call scheduling, incident management. Designed so observability survives a primary-region outage.

Region
eu-central-1
Surface
status / on-call

K3s · OneUptime · OpenTelemetry · Loki

otel

observability pipelineRunning

OpenTelemetry collector dual-exports metrics, logs, and traces to OneUptime and Loki at the same time. Consolidated fragmented monitoring into one observability stack. Grafana sits on top.

OpenTelemetry · Loki · Grafana · Prometheus

es-cluster

3-node ElasticsearchMaintenance

Self-managed three-node Elasticsearch cluster managed with Terraform and Ansible. Split deploy and split-restart playbooks so a single config change cannot cascade across the cluster.

Nodes
3
Deploy
split-restart

Elasticsearch · Terraform · Ansible

ci-cd

pipelines · 3 platformsRunning

CI/CD pipelines spanning Jenkins, GitLab CI, and AWS CodePipeline / CodeBuild. Targets include ECS, Lambda, CloudFront, and EC2 deployments. App and infra share pipeline patterns.

Platforms
3
Targets
ECS · λ · CF · EC2

Jenkins · GitLab CI · CodePipeline · CodeBuild · Docker

finops

AWS cost optimizationMaintenance

Cut monthly AWS spend by removing orphaned NAT Gateways, adding S3 and DynamoDB gateway endpoints to drop data-transfer cost, setting log-retention policies on CloudWatch, and right-sizing EBS volumes.

VPC Endpoints · CloudWatch Logs · EBS · NAT

hashnode-mcp

AI assistant integrationCompleted

Open-source Model Context Protocol server that wires AI assistants like Claude into the Hashnode content API. The pattern carries into the broader agentic-DevOps work. (Note: Hashnode has since wound down public API access.)

Python · MCP · Hashnode API

github.com/sbmagar13/hashnode-mcp-server

vqgan-clip

text-to-image · 2021Completed

Multimodal text-to-image generation using VQGAN + CLIP architectures in PyTorch. From the AI/ML era of the career, kept here as an artifact.

PyTorch · CLIP · VQGAN · Python

github.com/sbmagar13/VQGAN-CLIP-Text-to-Image

War stories

Self-hosted SRE platform on K3s

2025 · eu-central-1

Kubernetes (K3s)

Stood up OneUptime on a K3s cluster in a separate region (eu-central-1) for status pages, uptime monitoring, on-call scheduling and incident management. Deliberately off the primary region so observability survives a primary-region outage.

CI/CD across three platforms

2023 to present

Docker

Containerised build and deploy pipelines on Jenkins, GitLab CI, and AWS CodePipeline / CodeBuild. Targets include ECS, Lambda, CloudFront and EC2. App and infra share pipeline patterns so a single change can flow through any of them.

Recovered a 19-minute platform outage

2024 · production

AWS

Diagnosed and resolved a 19-minute full-platform outage caused by blocking Redis KEYS calls exhausting the Tomcat/JDBC thread pool. Added connection-pool checkout timeouts, tuned RDS parameters, then drove a 68-task reliability program across 11 epics and 7 sprints to prevent recurrence.

Built cross-region disaster recovery

2024

Aurora PostgreSQL

Established a cross-region DR path where none existed: Aurora Global Database from eu-north-1 to eu-west-1, EFS and ECR replication, shared KMS keys, and a documented runbook for promotion.

3-node Elasticsearch with split-restart

production

Terraform

Self-managed three-node Elasticsearch cluster orchestrated with Terraform and Ansible. Split deploy and split-restart playbooks so a single config change cannot cascade across the cluster.

Tenant provisioning orchestrator

production

Python

One Python API call sets up schema-per-tenant on Aurora, wires SQS and EventBridge, creates ALB listener rules, provisions a CloudFront / S3 distribution, configures Route 53 records, and registers the tenant in DynamoDB.

One observability surface

rolling

Grafana

Consolidated fragmented monitoring into one stack. Grafana over Prometheus, Loki and CloudWatch with per-cluster, per-namespace and per-tenant dashboards. Alert routing wired to OneUptime on-call.

Dual-export OTEL pipeline

2025

OpenTelemetry

OpenTelemetry collector dual-exports metrics, logs and traces to OneUptime and Loki at the same time. The duplication is the point: if a primary-region failure takes the main observability stack down, the OneUptime side still pages.

AI agents for DevOps work

2025 to present

Anthropic MCP

Self-learning track. Building MCP-based agents that wrap real DevOps tasks (log triage, runbook prompts, infra analysis) so Claude and similar assistants can drive them. Built an open-source Hashnode MCP server that wires AI assistants into the Hashnode content API. Current focus is the broader agentic-DevOps stack: MCP, LangGraph, local LLM inference.

Journey

  1. 2020

    AI / ML beginnings

    VolgAI, Genese Cloud Academy, IBZ Networks. Built AI chatbots with RASA (NLP), backend APIs with Django and Flask, RTSP/FFmpeg pipelines for CCTV image processing. Async work via Celery and RabbitMQ. AWS AI/ML Internship at Genese.

  2. 2021

    Bachelor + first DevOps role

    Graduated Bachelor in Computer Engineering from Tribhuvan University (IOE, Pokhara). Joined Cloudyfox Technology in September as DevOps Engineer. First Terraform / Terragrunt at scale across AWS.

  3. 2022

    Containers and pipelines

    Ran Kubernetes for containerized workloads at Cloudyfox. Built CI/CD on GitLab CI and Jenkins for both app and infra deploys. SysOps, Linux admin, OpenVPN, centralized logging with CloudWatch, ELK, Grafana.

  4. 2023

    Sole owner of a production platform

    Joined Threadcode Technologies as DevOps / SRE for EventLogic, a Swedish multi-tenant event-management SaaS. Owner of the entire AWS platform end to end: ECS Fargate, Aurora PostgreSQL, ElastiCache Redis, Amazon MQ in eu-north-1. Technical Reviewer for Python for DevOps (Packt).

  5. 2024

    Reliability + multi-region DR

    Diagnosed and resolved a 19-minute platform outage caused by blocking Redis KEYS calls exhausting the JDBC thread pool. Drove a 68-task reliability program across 11 epics and 7 sprints. Built cross-region DR from eu-north-1 to eu-west-1 with Aurora Global Database, EFS replication, and shared KMS keys.

  6. 2025

    Observability + first MCP work

    Deployed self-hosted OneUptime on K3s in eu-central-1 for status pages, uptime monitoring, and on-call. OpenTelemetry collector dual-exports to OneUptime and Loki so observability survives a primary-region outage. Shipped an open-source Hashnode MCP server that wires AI assistants like Claude into the Hashnode content API.

  7. 2026

    Agentic DevOps + platform hardening

    Still running the production platform end to end. Hardening the multi-region story, sharpening SLOs, and going deep on AI agents for DevOps work: MCP, LangGraph, local LLM inference, self-learning side projects that wrap real ops tasks. Shipped this 3D portfolio as the public face of the practice. Open to remote senior DevOps / SRE roles.

Skills

Infrastructure

Route 53 (4y), VPC Networking (4y), Terraform (4y), Terragrunt (3y), CloudFormation (3y), AWS CDK (2y), Ansible (4y), Docker (4y), Kubernetes (K3s) (3y)

CI/CD

GitLab CI (4y), Jenkins (4y), CodePipeline (3y), GitHub Actions (3y)

Cloud

AWS (4y), ECS Fargate (3y), AWS Lambda (4y), Amazon MQ (2y), CloudFront (4y), S3 (4y), API Gateway (3y), EFS (2y), ECR (4y)

Databases

Aurora PostgreSQL (3y), ElastiCache (Redis) (3y), DynamoDB (3y), AWS DMS (2y), PostgreSQL (4y), Redis (3y)

Monitoring

Prometheus (3y), Grafana (3y), Loki (2y), OpenTelemetry (2y), AWS CloudWatch (4y), OneUptime (1y), ELK Stack (4y), Elasticsearch (4y)

Development

Python (5y), FastAPI (3y), Django (3y), Flask (3y), Bash (5y), JavaScript (4y)

Security

AWS IAM (4y), AWS KMS (4y), Secrets Manager (3y), AWS Inspector (2y), CloudTrail (4y), OpenVPN (3y)

AI / ML

Anthropic MCP (1y), LangGraph (1y), LangChain (1y), Local LLM Inference (1y), PyTorch (2y), RASA (2y)

Operating Systems

Arch Linux (5y), Ubuntu (6y)

Misc

Apache Airflow (2y), Airbyte (1y), Celery + RabbitMQ (3y), FFmpeg (2y)

Education

Bachelor in Computer Engineering

Tribhuvan University, Western Regional Campus (IOE) · Pokhara, Nepal2016 · 2020

Certifications & Recognition

  • Technical Reviewer, Python for DevOps

    Packt Publishing2023
  • Generative AI: From GANs to CLIP with Python and PyTorch

    Udemy2021
  • AWS AI/ML Internship

    Genese Cloud Academy2020 · 2021

Contact