Why Kubernetes for Scaling
Kubernetes does not make scaling easy. It makes scaling possible at a level of reliability and control that manual server management cannot match. The investment in learning the system pays back in the first major traffic event you survive without intervention.
Horizontal Pod Autoscaler (HPA): CPU Is Not Enough
The basic HPA scales on CPU utilization. This works — but CPU is a lagging indicator. By the time CPU hits 80%, your users are already experiencing latency. Scale earlier and on metrics that matter to your workload.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Scale at 60%, not 80%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70Scale target at 60% CPU, not 80%. Kubernetes takes 30–60 seconds to spin up new pods. You need headroom for the scaling lag.
KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) scales based on external metrics — queue depth, Kafka lag, Redis list length, HTTP request rate. This is how you scale workers proportional to actual work, not proxy metrics.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: email-worker-scaler
spec:
scaleTargetRef:
name: email-worker
minReplicaCount: 0 # Scale to zero when queue is empty
maxReplicaCount: 50
triggers:
- type: redis
metadata:
listName: bull:emails:wait
listLength: "10" # One worker per 10 queued jobs
address: redis:6379Scale to zero during off-hours. Scale to 50 during the morning email send. Pay only for what you use.
Zero-Downtime Deployments: Rolling Updates
The default Kubernetes deployment strategy is rolling update — new pods come up before old ones go down. Configure it explicitly:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # Max extra pods during update
maxUnavailable: 0 # Never reduce below desired count`maxUnavailable: 0` guarantees capacity is maintained throughout the deployment. Combined with readiness probes, traffic only routes to pods that are fully initialized.
Readiness and Liveness Probes: The Safety Net
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3Readiness: "Am I ready to receive traffic?" — fails if database connection is not established, cache is warming, or dependent services are unreachable. Liveness: "Am I still alive?" — fails if the process has deadlocked. Kubernetes restarts on liveness failure.
A /health/ready endpoint that checks actual dependencies (database ping, Redis ping) prevents the most common zero-downtime deployment failure: a new pod receiving traffic before it is initialized.
Pod Disruption Budgets: Protecting Availability During Cluster Operations
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: apiPDBs prevent cluster operations (node drains, upgrades) from taking down too many pods simultaneously. Without one, a node drain could evict all your pods at once.
The Scaling Incident Runbook
Every team running Kubernetes should have a documented runbook for the two most common incidents:
- HPA at max replicas, still degraded: Check if the bottleneck has moved to the database. App scaling cannot fix database saturation.
- Deployment stuck, pods not becoming ready: Check readiness probe failures with `kubectl describe pod`. Almost always a missing environment variable, failed dependency connection, or misconfigured secret.