Kubernetes KubernetesProduction

Kubernetes Production Best Practices: The Definitive 2025 Checklist

Battle-tested Kubernetes production checklist: security hardening, reliability patterns, observability, and FinOps — from engineers managing 120+ clusters.

A

Founder & CTO, KubeAce

12 min read

Running Kubernetes in production is categorically different from running it in a sandbox. After managing 120+ production clusters across EKS, GKE, AKS, and RKE2 — we’ve distilled everything that separates stable, cost-efficient clusters from the ones that wake engineers up at 2 AM.

This is our living checklist. We update it quarterly.

1. Security: Assume Breach

Pod Security Standards (PSA)

Kubernetes 1.25 removed PodSecurityPolicy. You should be using Pod Security Admission (PSA) with at minimum restricted mode for workloads that don’t need elevated permissions:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

For legacy workloads that need privileged, isolate them in dedicated namespaces with compensating controls.

RBAC: Principle of Least Privilege

Never use cluster-admin for service accounts. Always scope permissions to the minimum required:

# Bad: ClusterRole with wildcard
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

# Good: Minimal namespaced permissions
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "update", "patch"]
  resourceNames: ["my-specific-deployment"]

Audit your RBAC regularly with kubectl-who-can or rbac-lookup.

Network Policies

By default, Kubernetes allows all pod-to-pod communication. This is a massive blast radius in a breach. Apply default-deny policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then add explicit allowances only where needed.

Secrets Management

Do not store secrets in ConfigMaps or environment variables from unencrypted Secrets. Use:

  • External Secrets Operator with AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault
  • Enable etcd encryption at rest (surprisingly not on by default in all managed services)
  • Seal secrets in Git with Sealed Secrets or SOPS

2. Reliability: Build for Failure

Resource Requests and Limits

This is where most clusters go wrong. Without requests, the scheduler can’t make intelligent placement decisions. Without limits, a runaway process can starve neighbours.

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"      # Be careful with CPU limits — they cause throttling
    memory: "512Mi"   # Memory limits are hard limits — OOM kill

Pro tip: CPU limits cause throttling even when nodes have spare capacity. For latency-sensitive workloads, set requests but omit CPU limits and use LimitRange defaults instead.

Pod Disruption Budgets

Before any node drain or cluster upgrade, PDBs ensure you never lose all replicas of a service:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2   # or maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server

Anti-Affinity Rules

Don’t let all replicas land on the same node. Prefer soft affinity to avoid scheduling deadlocks:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        topologyKey: kubernetes.io/hostname
        labelSelector:
          matchLabels:
            app: api-server

Liveness vs Readiness vs Startup Probes

These three probes are distinct and commonly confused:

  • Startup probe: Is the app done initialising? Prevents premature liveness kills during slow starts.
  • Readiness probe: Should this pod receive traffic? Failed = removed from Service endpoints.
  • Liveness probe: Is the app in a recoverable state? Failed = pod restart.

Misconfiguring liveness probes is the single most common cause of self-inflicted outages we see.

3. Observability: If You Can’t Measure It, You Can’t Fix It

The Four Golden Signals

Instrument every service for:

  1. Latency — P50, P95, P99, not just averages
  2. Traffic — requests per second, not just total count
  3. Errors — 4xx vs 5xx, client vs server
  4. Saturation — CPU throttling %, memory pressure, queue depth

Structured Logging

Use structured JSON logging from day one. Parse-friendly logs are invaluable in Loki or CloudWatch:

{"level":"error","time":"2025-01-15T10:23:45Z","msg":"database connection failed","service":"api","retry":3,"latency_ms":2341}

Distributed Tracing

Instrument with OpenTelemetry SDK and send traces to Grafana Tempo or Jaeger. Correlate trace IDs across logs, metrics, and traces. This cuts MTTR by 60-80% for complex microservice issues.

4. Cost: FinOps as a Practice

Cluster Autoscaler vs KEDA vs Karpenter

  • Cluster Autoscaler: Reacts to pending pods. Slow (1-3 minute scale-out). Works everywhere.
  • Karpenter (AWS-native): Provisions nodes in seconds, supports any instance type. Dramatically more efficient on EKS.
  • KEDA: Scales workloads based on external metrics (queue depth, Kafka lag, Prometheus). Pair with Cluster Autoscaler.

For most EKS users in 2025: Karpenter + KEDA is the winning combination.

Spot/Preemptible Instances

Running stateless workloads on Spot? You should be. Best practices:

  • Run at least one on-demand node per availability zone for cluster-critical components
  • Use PodDisruptionBudgets and graceful shutdown handling
  • Diversify across 5+ instance types to avoid spot interruptions
  • Set spot interruption handlers (AWS Node Termination Handler)

Namespace-Level Cost Tracking

Use Kubecost or OpenCost to attribute spend to teams and namespaces. Showback → Chargeback → Optimisation. Without attribution, nobody owns the bill.

5. Upgrades: Do Them Frequently

Falling behind on Kubernetes versions is a security and supportability risk. Managed services (EKS, GKE) make in-place upgrades relatively safe.

Our approach:

  1. Keep a non-production cluster 1 version ahead of production
  2. Test workload compatibility in staging with pluto (deprecated API detector)
  3. Upgrade control plane, then node groups, one at a time
  4. Do this every 6-8 months — not once a year

Conclusion

No single item on this list will save you. Production readiness is the intersection of all of them. Start with the quick wins (resource limits, PDBs, structured logging) and work toward the more complex items (FinOps, distributed tracing, multi-cluster).

If you’d like KubeAce to conduct a free production readiness audit of your Kubernetes cluster, schedule a call with our team. We typically find 10-15 actionable improvements in the first session.