Running Kubernetes in production is categorically different from running it in a sandbox. After managing 120+ production clusters across EKS, GKE, AKS, and RKE2 — we’ve distilled everything that separates stable, cost-efficient clusters from the ones that wake engineers up at 2 AM.
This is our living checklist. We update it quarterly.
1. Security: Assume Breach
Pod Security Standards (PSA)
Kubernetes 1.25 removed PodSecurityPolicy. You should be using Pod Security Admission (PSA) with at minimum restricted mode for workloads that don’t need elevated permissions:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
For legacy workloads that need privileged, isolate them in dedicated namespaces with compensating controls.
RBAC: Principle of Least Privilege
Never use cluster-admin for service accounts. Always scope permissions to the minimum required:
# Bad: ClusterRole with wildcard
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
# Good: Minimal namespaced permissions
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "update", "patch"]
resourceNames: ["my-specific-deployment"]
Audit your RBAC regularly with kubectl-who-can or rbac-lookup.
Network Policies
By default, Kubernetes allows all pod-to-pod communication. This is a massive blast radius in a breach. Apply default-deny policies:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Then add explicit allowances only where needed.
Secrets Management
Do not store secrets in ConfigMaps or environment variables from unencrypted Secrets. Use:
- External Secrets Operator with AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault
- Enable etcd encryption at rest (surprisingly not on by default in all managed services)
- Seal secrets in Git with Sealed Secrets or SOPS
2. Reliability: Build for Failure
Resource Requests and Limits
This is where most clusters go wrong. Without requests, the scheduler can’t make intelligent placement decisions. Without limits, a runaway process can starve neighbours.
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m" # Be careful with CPU limits — they cause throttling
memory: "512Mi" # Memory limits are hard limits — OOM kill
Pro tip: CPU limits cause throttling even when nodes have spare capacity. For latency-sensitive workloads, set requests but omit CPU limits and use LimitRange defaults instead.
Pod Disruption Budgets
Before any node drain or cluster upgrade, PDBs ensure you never lose all replicas of a service:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: api-server
Anti-Affinity Rules
Don’t let all replicas land on the same node. Prefer soft affinity to avoid scheduling deadlocks:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: api-server
Liveness vs Readiness vs Startup Probes
These three probes are distinct and commonly confused:
- Startup probe: Is the app done initialising? Prevents premature liveness kills during slow starts.
- Readiness probe: Should this pod receive traffic? Failed = removed from Service endpoints.
- Liveness probe: Is the app in a recoverable state? Failed = pod restart.
Misconfiguring liveness probes is the single most common cause of self-inflicted outages we see.
3. Observability: If You Can’t Measure It, You Can’t Fix It
The Four Golden Signals
Instrument every service for:
- Latency — P50, P95, P99, not just averages
- Traffic — requests per second, not just total count
- Errors — 4xx vs 5xx, client vs server
- Saturation — CPU throttling %, memory pressure, queue depth
Structured Logging
Use structured JSON logging from day one. Parse-friendly logs are invaluable in Loki or CloudWatch:
{"level":"error","time":"2025-01-15T10:23:45Z","msg":"database connection failed","service":"api","retry":3,"latency_ms":2341}
Distributed Tracing
Instrument with OpenTelemetry SDK and send traces to Grafana Tempo or Jaeger. Correlate trace IDs across logs, metrics, and traces. This cuts MTTR by 60-80% for complex microservice issues.
4. Cost: FinOps as a Practice
Cluster Autoscaler vs KEDA vs Karpenter
- Cluster Autoscaler: Reacts to pending pods. Slow (1-3 minute scale-out). Works everywhere.
- Karpenter (AWS-native): Provisions nodes in seconds, supports any instance type. Dramatically more efficient on EKS.
- KEDA: Scales workloads based on external metrics (queue depth, Kafka lag, Prometheus). Pair with Cluster Autoscaler.
For most EKS users in 2025: Karpenter + KEDA is the winning combination.
Spot/Preemptible Instances
Running stateless workloads on Spot? You should be. Best practices:
- Run at least one on-demand node per availability zone for cluster-critical components
- Use
PodDisruptionBudgetsand graceful shutdown handling - Diversify across 5+ instance types to avoid spot interruptions
- Set spot interruption handlers (AWS Node Termination Handler)
Namespace-Level Cost Tracking
Use Kubecost or OpenCost to attribute spend to teams and namespaces. Showback → Chargeback → Optimisation. Without attribution, nobody owns the bill.
5. Upgrades: Do Them Frequently
Falling behind on Kubernetes versions is a security and supportability risk. Managed services (EKS, GKE) make in-place upgrades relatively safe.
Our approach:
- Keep a non-production cluster 1 version ahead of production
- Test workload compatibility in staging with
pluto(deprecated API detector) - Upgrade control plane, then node groups, one at a time
- Do this every 6-8 months — not once a year
Conclusion
No single item on this list will save you. Production readiness is the intersection of all of them. Start with the quick wins (resource limits, PDBs, structured logging) and work toward the more complex items (FinOps, distributed tracing, multi-cluster).
If you’d like KubeAce to conduct a free production readiness audit of your Kubernetes cluster, schedule a call with our team. We typically find 10-15 actionable improvements in the first session.