Kubernetes Autoscaling & Zero-Downtime Deployments

Why Kubernetes for Scaling

Kubernetes does not make scaling easy. It makes scaling possible at a level of reliability and control that manual server management cannot match. The investment in learning the system pays back in the first major traffic event you survive without intervention.

Horizontal Pod Autoscaler (HPA): CPU Is Not Enough

The basic HPA scales on CPU utilization. This works — but CPU is a lagging indicator. By the time CPU hits 80%, your users are already experiencing latency. Scale earlier and on metrics that matter to your workload.

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api
      minReplicas: 3
      maxReplicas: 100
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 60     # Scale at 60%, not 80%
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 70

Scale target at 60% CPU, not 80%. Kubernetes takes 30–60 seconds to spin up new pods. You need headroom for the scaling lag.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) scales based on external metrics — queue depth, Kafka lag, Redis list length, HTTP request rate. This is how you scale workers proportional to actual work, not proxy metrics.

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: email-worker-scaler
    spec:
      scaleTargetRef:
        name: email-worker
      minReplicaCount: 0        # Scale to zero when queue is empty
      maxReplicaCount: 50
      triggers:
        - type: redis
          metadata:
            listName: bull:emails:wait
            listLength: "10"    # One worker per 10 queued jobs
            address: redis:6379

Scale to zero during off-hours. Scale to 50 during the morning email send. Pay only for what you use.

Zero-Downtime Deployments: Rolling Updates

The default Kubernetes deployment strategy is rolling update — new pods come up before old ones go down. Configure it explicitly:

    spec:
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 25%          # Max extra pods during update
          maxUnavailable: 0      # Never reduce below desired count

`maxUnavailable: 0` guarantees capacity is maintained throughout the deployment. Combined with readiness probes, traffic only routes to pods that are fully initialized.

Readiness and Liveness Probes: The Safety Net

    readinessProbe:
      httpGet:
        path: /health/ready
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 3

    livenessProbe:
      httpGet:
        path: /health/live
        port: 3000
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3

Readiness: "Am I ready to receive traffic?" — fails if database connection is not established, cache is warming, or dependent services are unreachable. Liveness: "Am I still alive?" — fails if the process has deadlocked. Kubernetes restarts on liveness failure.

A /health/ready endpoint that checks actual dependencies (database ping, Redis ping) prevents the most common zero-downtime deployment failure: a new pod receiving traffic before it is initialized.

Pod Disruption Budgets: Protecting Availability During Cluster Operations

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: api-pdb
    spec:
      minAvailable: 2     # Always keep at least 2 pods running
      selector:
        matchLabels:
          app: api

PDBs prevent cluster operations (node drains, upgrades) from taking down too many pods simultaneously. Without one, a node drain could evict all your pods at once.

The Scaling Incident Runbook

Every team running Kubernetes should have a documented runbook for the two most common incidents:

HPA at max replicas, still degraded: Check if the bottleneck has moved to the database. App scaling cannot fix database saturation.
Deployment stuck, pods not becoming ready: Check readiness probe failures with `kubectl describe pod`. Almost always a missing environment variable, failed dependency connection, or misconfigured secret.

Why Kubernetes for Scaling

Horizontal Pod Autoscaler (HPA): CPU Is Not Enough

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api
      minReplicas: 3
      maxReplicas: 100
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 60     # Scale at 60%, not 80%
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 70

Scale target at 60% CPU, not 80%. Kubernetes takes 30–60 seconds to spin up new pods. You need headroom for the scaling lag.

KEDA: Event-Driven Autoscaling

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: email-worker-scaler
    spec:
      scaleTargetRef:
        name: email-worker
      minReplicaCount: 0        # Scale to zero when queue is empty
      maxReplicaCount: 50
      triggers:
        - type: redis
          metadata:
            listName: bull:emails:wait
            listLength: "10"    # One worker per 10 queued jobs
            address: redis:6379

Scale to zero during off-hours. Scale to 50 during the morning email send. Pay only for what you use.

Zero-Downtime Deployments: Rolling Updates

The default Kubernetes deployment strategy is rolling update — new pods come up before old ones go down. Configure it explicitly:

    spec:
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 25%          # Max extra pods during update
          maxUnavailable: 0      # Never reduce below desired count

`maxUnavailable: 0` guarantees capacity is maintained throughout the deployment. Combined with readiness probes, traffic only routes to pods that are fully initialized.

Readiness and Liveness Probes: The Safety Net

    readinessProbe:
      httpGet:
        path: /health/ready
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 3

    livenessProbe:
      httpGet:
        path: /health/live
        port: 3000
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3

A /health/ready endpoint that checks actual dependencies (database ping, Redis ping) prevents the most common zero-downtime deployment failure: a new pod receiving traffic before it is initialized.

Pod Disruption Budgets: Protecting Availability During Cluster Operations

    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: api-pdb
    spec:
      minAvailable: 2     # Always keep at least 2 pods running
      selector:
        matchLabels:
          app: api

PDBs prevent cluster operations (node drains, upgrades) from taking down too many pods simultaneously. Without one, a node drain could evict all your pods at once.

The Scaling Incident Runbook

Every team running Kubernetes should have a documented runbook for the two most common incidents:

HPA at max replicas, still degraded: Check if the bottleneck has moved to the database. App scaling cannot fix database saturation.
Deployment stuck, pods not becoming ready: Check readiness probe failures with `kubectl describe pod`. Almost always a missing environment variable, failed dependency connection, or misconfigured secret.

Kubernetes Autoscaling & Zero-Downtime Deployments

Why Kubernetes for Scaling

Horizontal Pod Autoscaler (HPA): CPU Is Not Enough

KEDA: Event-Driven Autoscaling

Zero-Downtime Deployments: Rolling Updates

Readiness and Liveness Probes: The Safety Net

Pod Disruption Budgets: Protecting Availability During Cluster Operations

The Scaling Incident Runbook

More from the blog

DevOps in 2025: Platform Engineering, GitOps, and the Death of "Works on My Machine"

Serverless vs Containers: When Lambda Wins and When Kubernetes Is Right

Want to build something like this?

Kubernetes Autoscaling & Zero-Downtime Deployments

Why Kubernetes for Scaling

Horizontal Pod Autoscaler (HPA): CPU Is Not Enough

KEDA: Event-Driven Autoscaling

Zero-Downtime Deployments: Rolling Updates

Readiness and Liveness Probes: The Safety Net

Pod Disruption Budgets: Protecting Availability During Cluster Operations

The Scaling Incident Runbook

More from the blog

DevOps in 2025: Platform Engineering, GitOps, and the Death of "Works on My Machine"

Serverless vs Containers: When Lambda Wins and When Kubernetes Is Right

Want to build something like this?