The Challenge
Our client is a fast-growing HRTech platform used by enterprise recruiting teams across India and Southeast Asia to conduct structured video interviews at scale. As hiring volumes surged — particularly during campus recruitment seasons — their video infrastructure began showing serious cracks.
The core problems:
- Latency spikes of 4–6 seconds during peak load windows (morning batches, 9–11 AM IST)
- Dropped sessions when more than 800 concurrent interviews were in progress
- No auto-scaling — the infrastructure was provisioned for average load, not peak
- Manual incident response — engineers were on-call 24×7 during recruitment drives, manually restarting services
Each dropped video session represented a real cost: a candidate left with a poor impression, a recruiter who lost confidence in the platform, and a potential enterprise renewal at risk.
What We Did
Phase 1 — Infrastructure Audit & Architecture Design (Week 1–2)
KubeAce conducted a full infrastructure audit and identified the root causes: a single-region, vertically-scaled deployment with no SFU layer and no horizontal scaling capability. We designed a replacement architecture based on:
- LiveKit SFU as the media server layer, with selective forwarding to minimise bandwidth per session
- Amazon EKS as the Kubernetes control plane, with managed node groups using Karpenter for real-time auto-scaling
- Multi-region SFU pods deployed in
ap-south-1(Mumbai),ap-southeast-1(Singapore), andeu-central-1(Frankfurt) — routed by geo-proximity
Phase 2 — Kubernetes Migration (Week 3–5)
We built the full infrastructure as code using Terraform, then deployed:
- EKS cluster with dedicated node groups for SFU workloads (compute-optimised, C5 instances) and general workloads
- Karpenter for sub-60-second node provisioning during load spikes
- LiveKit SFU pods with horizontal pod autoscaler tied to active session count metrics (custom metrics via Prometheus Adapter)
- TURN/STUN server cluster on separate node groups to handle NAT traversal without contending with SFU CPU
Phase 3 — CI/CD & Observability (Week 5–6)
- ArgoCD GitOps pipeline: every merge to
maintriggers a staged rollout — canary → 10% → 50% → 100% — with automatic rollback on error-rate breach - Prometheus + Grafana dashboards for SFU session counts, packet loss, jitter, and latency per region
- PagerDuty integration with SLO-based alerting — engineers are only paged when a real SLO breach occurs, not on every blip
Phase 4 — Load Testing & Go-Live
We simulated 12,000 concurrent video sessions using a purpose-built load test harness before go-live. The cluster scaled from 3 nodes to 27 nodes in under 4 minutes and sustained peak load with stable latency throughout.
Results
| Metric | Before | After |
|---|---|---|
| Peak concurrent sessions | ~800 | 10,000+ |
| Median latency | 4,200ms | 74ms |
| Platform uptime (30-day) | 94.1% | 99.95% |
| Infrastructure cost (monthly) | ₹18L | ₹7.2L |
| On-call incidents per month | 22 | 1 |
The engineering team went from fighting fires during every recruitment season to running their largest-ever hiring drive — 8,400 simultaneous interviews — with zero incidents and no engineers on-call.
Technologies Used
- Kubernetes (EKS) — container orchestration across 3 regions
- LiveKit SFU — selective forwarding unit for real-time video and audio
- Karpenter — just-in-time node provisioning for burst scaling
- Terraform — full infrastructure as code
- ArgoCD — GitOps continuous delivery with canary rollouts
- Prometheus + Grafana — observability and SLO dashboards
- TURN/STUN — NAT traversal for WebRTC reliability