The Challenge

Our client is a fast-growing HRTech platform used by enterprise recruiting teams across India and Southeast Asia to conduct structured video interviews at scale. As hiring volumes surged — particularly during campus recruitment seasons — their video infrastructure began showing serious cracks.

The core problems:

Latency spikes of 4–6 seconds during peak load windows (morning batches, 9–11 AM IST)
Dropped sessions when more than 800 concurrent interviews were in progress
No auto-scaling — the infrastructure was provisioned for average load, not peak
Manual incident response — engineers were on-call 24×7 during recruitment drives, manually restarting services

Each dropped video session represented a real cost: a candidate left with a poor impression, a recruiter who lost confidence in the platform, and a potential enterprise renewal at risk.

What We Did

Phase 1 — Infrastructure Audit & Architecture Design (Week 1–2)

KubeAce conducted a full infrastructure audit and identified the root causes: a single-region, vertically-scaled deployment with no SFU layer and no horizontal scaling capability. We designed a replacement architecture based on:

LiveKit SFU as the media server layer, with selective forwarding to minimise bandwidth per session
Amazon EKS as the Kubernetes control plane, with managed node groups using Karpenter for real-time auto-scaling
Multi-region SFU pods deployed in ap-south-1 (Mumbai), ap-southeast-1 (Singapore), and eu-central-1 (Frankfurt) — routed by geo-proximity

Phase 2 — Kubernetes Migration (Week 3–5)

We built the full infrastructure as code using Terraform, then deployed:

EKS cluster with dedicated node groups for SFU workloads (compute-optimised, C5 instances) and general workloads
Karpenter for sub-60-second node provisioning during load spikes
LiveKit SFU pods with horizontal pod autoscaler tied to active session count metrics (custom metrics via Prometheus Adapter)
TURN/STUN server cluster on separate node groups to handle NAT traversal without contending with SFU CPU

Phase 3 — CI/CD & Observability (Week 5–6)

ArgoCD GitOps pipeline: every merge to main triggers a staged rollout — canary → 10% → 50% → 100% — with automatic rollback on error-rate breach
Prometheus + Grafana dashboards for SFU session counts, packet loss, jitter, and latency per region
PagerDuty integration with SLO-based alerting — engineers are only paged when a real SLO breach occurs, not on every blip

Phase 4 — Load Testing & Go-Live

We simulated 12,000 concurrent video sessions using a purpose-built load test harness before go-live. The cluster scaled from 3 nodes to 27 nodes in under 4 minutes and sustained peak load with stable latency throughout.

Results

Metric	Before	After
Peak concurrent sessions	~800	10,000+
Median latency	4,200ms	74ms
Platform uptime (30-day)	94.1%	99.95%
Infrastructure cost (monthly)	₹18L	₹7.2L
On-call incidents per month	22	1

The engineering team went from fighting fires during every recruitment season to running their largest-ever hiring drive — 8,400 simultaneous interviews — with zero incidents and no engineers on-call.

Technologies Used

Kubernetes (EKS) — container orchestration across 3 regions
LiveKit SFU — selective forwarding unit for real-time video and audio
Karpenter — just-in-time node provisioning for burst scaling
Terraform — full infrastructure as code
ArgoCD — GitOps continuous delivery with canary rollouts
Prometheus + Grafana — observability and SLO dashboards
TURN/STUN — NAT traversal for WebRTC reliability

Scaling a Video Interview Platform to 10,000 Concurrent Sessions on Kubernetes

The Challenge

Our Solution

The Challenge

What We Did

Phase 1 — Infrastructure Audit & Architecture Design (Week 1–2)

Phase 2 — Kubernetes Migration (Week 3–5)

Phase 3 — CI/CD & Observability (Week 5–6)

Phase 4 — Load Testing & Go-Live

Results

Technologies Used

More Case Studies

Ready to Transform
Your Infrastructure?

Scaling a Video Interview Platform to 10,000 Concurrent Sessions on Kubernetes

⚠ The Challenge

✓ Our Solution

The Challenge

What We Did

Phase 1 — Infrastructure Audit & Architecture Design (Week 1–2)

Phase 2 — Kubernetes Migration (Week 3–5)

Phase 3 — CI/CD & Observability (Week 5–6)

Phase 4 — Load Testing & Go-Live

Results

Technologies Used

More Case Studies

Ready to TransformYour Infrastructure?

The Challenge

Our Solution

Ready to Transform
Your Infrastructure?