"I don't just keep the lights on. I make sure the building was designed so the lights can't go off."
@ Qubole (Nasdaq-listed big data platform) · multi-cloud production · enterprise SLAs
| i | Cut cloud spend 30% (~$60K/year) across AWS + GCP — per-service dashboards, reserved capacity, spot for batch |
| ii | MTTR 60 min → under 20 — 15+ PagerDuty-wired runbooks. Traced a live regression, rolled back in 18 minutes |
| iii | SOC2 Type II in 6 months — IRSA, Workload Identity, Trivy + Snyk in CI. Zero long-lived credentials. Zero findings |
| iv | Zero-downtime MySQL 5.7 → 8.0 — blue-green deploys, DMS lag monitoring, two weeks parallel validation |
| v | P0 EKS node failure recovered in <30 min — drained nodes, shifted ALB weights mid-enterprise-traffic |
| vi | Argo Rollouts canary — bad deploys caught at 5% traffic, auto-rolled back. Weekly batches → multiple daily releases |
| vii | <2 min RTO on AZ failure — quarterly DR drills via AWS FIS. Chaos surfaced 4 failure modes before production did |
4 yrs production SRE |
$60K saved / year |
99.9% SLO · 10+ services |
<20 min MTTR |
65% CVEs reduced |
|---|
terraform-aws-eks-fargate-cluster ★ 32 ⑂ 59
Production-ready Terraform module — EKS with Fargate, VPC, IRSA, RBAC wired from day one. Used by teams who don't want to start from scratch. Battle-tested at Qubole.
SRE war stories, cloud cost engineering, and infrastructure deep-dives — harshetjain.medium.com →
AWS Certified Solutions Architect · RHCE · RHCSA · Red Hat Containers & Kubernetes
AWS Community Builder · Qubole, New Delhi · open to async-first remote roles worldwide

