Kubernetes Deployments: Rolling Updates, Readiness Probes, and Why Rollbacks Fail Silently — Blog

What a Deployment manages

A Deployment is a Kubernetes resource that manages a ReplicaSet, which manages Pods. You describe the desired state — image version, replica count, resource limits — and the deployment controller continuously reconciles actual state with desired state.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 4
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api
        image: myrepo/api:v2.1.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10

Rolling update mechanics

ConceptKubernetes

When you change the pod template (new image, env var, resource limit), Kubernetes creates a new ReplicaSet and gradually shifts pods from the old ReplicaSet to the new one. The rate is controlled by maxSurge and maxUnavailable.

Prerequisites

pods and ReplicaSets
service and selector
health probes

Key Points

maxSurge: how many extra pods above replicas can exist during the rollout (defaults to 25%).
maxUnavailable: how many pods below replicas can be down during the rollout (defaults to 25%).
A pod is considered 'available' only after its readiness probe passes. Without a readiness probe, pods are immediately considered available.
The old ReplicaSet is scaled down only after new pods become ready. With no readiness probe, broken pods look ready — the rollout succeeds while traffic is being sent to failing pods.

maxSurge and maxUnavailable: the tradeoff

With 4 replicas and default settings (25%):

maxSurge = 1      (can have 5 pods during rollout)
maxUnavailable = 1 (can have 3 available pods minimum)

Kubernetes starts one new pod. When it becomes ready, it terminates one old pod. Continues until all old pods are replaced. At no point are fewer than 3 pods available.

For zero-downtime deploys with a readiness probe, the important setting is maxUnavailable: 0. This prevents any pod from being terminated before its replacement is ready:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1        # Allow one extra pod temporarily
    maxUnavailable: 0  # Never take a pod down before replacement is ready

With maxUnavailable: 0, rollout is slower (sequential) but guarantees capacity is maintained throughout.

The readiness probe trap

The readiness probe is what determines when a new pod is "ready" to receive traffic. A pod without a readiness probe is considered ready immediately after the container starts — before your application has actually finished initializing.

readinessProbe:
  httpGet:
    path: /ready  # Your app's readiness endpoint
    port: 8080
  initialDelaySeconds: 5   # Wait before first check
  periodSeconds: 10        # Check every 10s
  failureThreshold: 3      # 3 failures → not ready
  successThreshold: 1      # 1 success → ready

If your application takes 15 seconds to warm up (connect to database, load caches), but initialDelaySeconds is 2, Kubernetes marks the pod as not ready, counts 3 failures in 30 seconds, and the pod stays out of rotation. Set initialDelaySeconds conservatively.

The liveness probe is different: it kills and restarts a pod that fails it. Do not confuse the two:

Readiness: should this pod receive traffic? Fails → pod removed from service, not restarted.
Liveness: is this pod alive? Fails → pod restarted.

Why rollbacks fail silently

kubectl rollout undo reverts to the previous ReplicaSet. This looks like a rollback, but there is a common failure mode: if the previous version had a configuration issue (wrong image, bad env var), the "rollback" deploys the same broken configuration.

Worse: if the new version has a bug that causes it to start but immediately fail requests (crashing after init, returning 500 to the health check endpoint), the deployment succeeds — pods are running — but traffic is failing.

# Check rollout status
kubectl rollout status deployment/api-server

# View rollout history
kubectl rollout history deployment/api-server

# Roll back to previous revision
kubectl rollout undo deployment/api-server

# Roll back to a specific revision
kubectl rollout undo deployment/api-server --to-revision=3

The only reliable way to ensure a rollback works is a readiness probe that accurately reflects the application's ability to serve traffic. A probe that always returns 200 is worse than no probe — it gives false confidence.

⚠PodDisruptionBudget: protecting availability during node maintenance

Rolling updates respect maxUnavailable, but node drains (for cluster upgrades, spot interruptions) do not respect Deployment settings by default — they evict pods without regard for service availability.

A PodDisruptionBudget (PDB) limits how many pods of a deployment can be disrupted simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 3  # always keep at least 3 pods running
  selector:
    matchLabels:
      app: api-server

With 4 replicas and minAvailable: 3, a node drain can evict one pod at a time. The second eviction is blocked until the first pod is rescheduled and ready elsewhere.

Without a PDB, a cluster upgrade that drains two nodes simultaneously will evict all 4 pods at once. Combine PDBs with pod anti-affinity rules (spread pods across nodes/AZs) for production services.

Kubernetes Deployment vs Amazon ECS Service

Both manage groups of container replicas, handle rolling updates, and integrate with load balancers. The operational model differs.

Kubernetes Deployment

Platform-agnostic — runs on any Kubernetes cluster (EKS, GKE, self-managed)
Fine-grained control: maxSurge, maxUnavailable, readinessProbe, PDB
Rollback via rollout history (ReplicaSet snapshots)
Requires setting up Ingress or Service type=LoadBalancer for external traffic
HPA scales based on CPU, memory, or custom metrics

ECS Service

AWS-native: deep integration with ALB, IAM, CloudWatch, Service Discovery
Deployment circuit breaker: auto-rollback if new tasks fail health checks
No concept of ReplicaSet history — rollback means redeploying old task definition
ALB integration is automatic — task registration/deregistration handled by ECS
Fargate removes node management entirely

Verdict

ECS is simpler for AWS-only deployments — fewer abstractions, better native AWS integration. Kubernetes is appropriate when you need portability, more control over scheduling, or have multi-cloud requirements. On AWS, EKS adds operational overhead vs ECS without clear benefit unless you already invest in Kubernetes tooling.

A rolling deployment completes successfully (kubectl rollout status shows 'successfully rolled out') but users report increased error rates immediately after the deploy. What is the most likely gap in the deployment configuration?

medium

The new version starts up but takes 30 seconds to connect to the database and warm its cache. The readiness probe checks /health which returns 200 immediately on startup before the app is truly ready.

AmaxUnavailable should be set to 0 to prevent any downtime
Incorrect.maxUnavailable=0 ensures old pods stay until new pods are ready. But 'ready' is determined by the readiness probe — if the probe returns 200 before the app is actually ready, maxUnavailable=0 still lets broken pods receive traffic.
BThe readiness probe endpoint returns 200 before the application is actually ready to serve traffic
Correct!A readiness probe that passes immediately on startup tells Kubernetes the pod is ready before it actually is. Traffic routes to the new pod while it is still warming up. The fix: make /health (or a separate /ready endpoint) return non-200 until database connection is established and caches are loaded. initialDelaySeconds also needs to be set to cover the worst-case startup time.
CThe deployment needs a liveness probe to restart failing pods
Incorrect.Liveness probes restart pods that become unresponsive after startup. The problem here is the pod accepting traffic before it is ready at startup — a liveness probe does not help with that.
DThe replica count is too low to handle the rolling update
Incorrect.Replica count affects capacity during rollout, not whether individual pods serve traffic correctly. The issue is the readiness signal, not the number of pods.

Hint:The deployment thinks the pod is ready. What determines 'ready' in Kubernetes, and why might it be wrong?