E

Enterprise HRTech Platform

HRTech

Case Study

Scaling a Video Interview Platform to 10,000 Concurrent Sessions on Kubernetes

How KubeAce rebuilt a struggling video interview platform on Kubernetes with LiveKit SFU — cutting latency from 4s to under 80ms at scale.

KubernetesEKSLiveKitArgoCDTerraformKarpenterPrometheusGrafana
10K+
Concurrent Sessions
<80ms
End-to-End Latency
99.95%
Platform Uptime
60%
Infrastructure Cost Saved

The Challenge

The client's legacy self-hosted video infrastructure buckled under peak hiring season load — dropping calls, spiking latency to 4+ seconds, and failing to recover gracefully. Every engineering incident during a candidate interview damaged recruiter trust and cost deals.

Our Solution

KubeAce migrated the client's video infrastructure to a LiveKit SFU cluster on EKS, with auto-scaling node groups, geo-distributed SFU pods across Mumbai, Singapore, and Frankfurt, and a zero-downtime blue-green deployment pipeline managed via ArgoCD.

The Challenge

Our client is a fast-growing HRTech platform used by enterprise recruiting teams across India and Southeast Asia to conduct structured video interviews at scale. As hiring volumes surged — particularly during campus recruitment seasons — their video infrastructure began showing serious cracks.

The core problems:

  • Latency spikes of 4–6 seconds during peak load windows (morning batches, 9–11 AM IST)
  • Dropped sessions when more than 800 concurrent interviews were in progress
  • No auto-scaling — the infrastructure was provisioned for average load, not peak
  • Manual incident response — engineers were on-call 24×7 during recruitment drives, manually restarting services

Each dropped video session represented a real cost: a candidate left with a poor impression, a recruiter who lost confidence in the platform, and a potential enterprise renewal at risk.

What We Did

Phase 1 — Infrastructure Audit & Architecture Design (Week 1–2)

KubeAce conducted a full infrastructure audit and identified the root causes: a single-region, vertically-scaled deployment with no SFU layer and no horizontal scaling capability. We designed a replacement architecture based on:

  • LiveKit SFU as the media server layer, with selective forwarding to minimise bandwidth per session
  • Amazon EKS as the Kubernetes control plane, with managed node groups using Karpenter for real-time auto-scaling
  • Multi-region SFU pods deployed in ap-south-1 (Mumbai), ap-southeast-1 (Singapore), and eu-central-1 (Frankfurt) — routed by geo-proximity

Phase 2 — Kubernetes Migration (Week 3–5)

We built the full infrastructure as code using Terraform, then deployed:

  • EKS cluster with dedicated node groups for SFU workloads (compute-optimised, C5 instances) and general workloads
  • Karpenter for sub-60-second node provisioning during load spikes
  • LiveKit SFU pods with horizontal pod autoscaler tied to active session count metrics (custom metrics via Prometheus Adapter)
  • TURN/STUN server cluster on separate node groups to handle NAT traversal without contending with SFU CPU

Phase 3 — CI/CD & Observability (Week 5–6)

  • ArgoCD GitOps pipeline: every merge to main triggers a staged rollout — canary → 10% → 50% → 100% — with automatic rollback on error-rate breach
  • Prometheus + Grafana dashboards for SFU session counts, packet loss, jitter, and latency per region
  • PagerDuty integration with SLO-based alerting — engineers are only paged when a real SLO breach occurs, not on every blip

Phase 4 — Load Testing & Go-Live

We simulated 12,000 concurrent video sessions using a purpose-built load test harness before go-live. The cluster scaled from 3 nodes to 27 nodes in under 4 minutes and sustained peak load with stable latency throughout.

Results

MetricBeforeAfter
Peak concurrent sessions~80010,000+
Median latency4,200ms74ms
Platform uptime (30-day)94.1%99.95%
Infrastructure cost (monthly)₹18L₹7.2L
On-call incidents per month221

The engineering team went from fighting fires during every recruitment season to running their largest-ever hiring drive — 8,400 simultaneous interviews — with zero incidents and no engineers on-call.

Technologies Used

  • Kubernetes (EKS) — container orchestration across 3 regions
  • LiveKit SFU — selective forwarding unit for real-time video and audio
  • Karpenter — just-in-time node provisioning for burst scaling
  • Terraform — full infrastructure as code
  • ArgoCD — GitOps continuous delivery with canary rollouts
  • Prometheus + Grafana — observability and SLO dashboards
  • TURN/STUN — NAT traversal for WebRTC reliability

More Case Studies

View All →
Free 30-Minute Strategy Session — No Commitment

Ready to Transform
Your Infrastructure?

Whether you're migrating to Kubernetes, scaling a LiveKit deployment, or building a DevOps platform from scratch — our engineers have done it before.

Response within 4 hours
Serve clients globally from Bangalore