Kubernetes Pods: Resource Requests vs Limits, QoS Classes, and Sidecar Patterns — Blog

Resource requests vs limits: two different things

Resource requests and limits look similar in pod spec but do completely different things:

Requests: the amount of CPU/memory the scheduler reserves on a node for this container. A node is considered "full" for a given resource when the sum of all container requests equals the node's allocatable capacity. Requests don't limit actual usage — a container can use more CPU than requested.

Limits: the hard cap enforced at runtime. CPU limits throttle (slow down) a container when it exceeds the limit. Memory limits kill the container with OOMKilled when it exceeds the limit.

resources:
  requests:
    cpu: "500m"       # scheduler reserves 0.5 vCPU on the node
    memory: "512Mi"   # scheduler reserves 512MB on the node
  limits:
    cpu: "1000m"      # container throttled if it exceeds 1 vCPU
    memory: "1Gi"     # container killed (OOMKilled) if it exceeds 1GB

Quality of Service classes: which pods die first

ConceptKubernetes

Kubernetes assigns a QoS class to each pod based on its request/limit configuration. Under memory pressure, the kubelet kills pods in QoS class order: BestEffort first, then Burstable, then Guaranteed last.

Prerequisites

container resource requests and limits
Linux OOM killer
Kubernetes scheduling

Key Points

Guaranteed: requests == limits for all containers. First to keep, last to evict.
Burstable: requests set but lower than limits, or limits not set. Middle priority.
BestEffort: no requests or limits set. First to be evicted under pressure.
CPU overcommit is safe (throttling). Memory overcommit causes OOMKilled — set limits carefully.

# Guaranteed QoS: requests == limits
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"      # same as request
    memory: "512Mi"  # same as request

# Burstable QoS: requests < limits
resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

# BestEffort QoS: no resources configured at all
# (don't do this in production)

For production workloads: set memory requests and limits equal (Guaranteed QoS) for critical pods. This prevents them from being evicted during memory pressure and prevents memory limit surprises. Set CPU requests conservatively and limits higher — CPU throttling is recoverable, OOMKilled is not.

Init containers: prerequisite tasks before the main container

Init containers run to completion in order before the main containers start. They share the same pod volumes but run sequentially. If an init container fails, the pod is restarted (respecting restartPolicy).

spec:
  initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c',
      'until nc -z postgres-service 5432; do echo waiting for database; sleep 2; done']

  - name: run-migrations
    image: myapp:1.2.3
    command: ["python", "manage.py", "migrate"]
    env:
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: url

  containers:
  - name: api
    image: myapp:1.2.3
    ports:
    - containerPort: 8080

Init containers are the correct way to:

Wait for dependent services (database, message queue) to be ready before starting
Run database migrations before the application starts
Clone a git repo or download config files before the main container starts
Set up file system permissions before the main container runs

A failed init container prevents the pod from starting. This is intentional — you don't want an API pod to start if its database migration failed.

The sidecar pattern: running multiple containers in one pod

Containers in the same pod share network namespace (same IP, localhost communication) and can share volumes. This enables the sidecar pattern: a helper container augments the main container.

Common sidecars:

Log shipping: the main container writes logs to a shared volume; the sidecar tails and ships them to a log aggregator:

spec:
  volumes:
  - name: log-storage
    emptyDir: {}

  containers:
  - name: api
    image: myapp:1.2.3
    volumeMounts:
    - name: log-storage
      mountPath: /var/log/app

  - name: log-shipper
    image: fluentd:v1.16
    volumeMounts:
    - name: log-storage
      mountPath: /var/log/app
      readOnly: true
    resources:
      requests:
        cpu: "50m"
        memory: "64Mi"

Service mesh proxy (Envoy/Linkerd): injected automatically by the mesh admission webhook. Intercepts all network traffic in/out of the pod for observability, retries, and mTLS. The main container code doesn't change.

Secret rotation: fetches secrets from Vault or AWS Secrets Manager and writes them to a shared volume. The main container reads secrets as files — when the sidecar updates them, the main container picks up changes without restart.

📝Lifecycle hooks and probes: controlling startup and shutdown

Kubernetes has three probe types and two lifecycle hooks:

Probes:

livenessProbe: if this fails, the container is killed and restarted. Use for detecting deadlocks.
readinessProbe: if this fails, the pod is removed from Service endpoints. No traffic sent. Use for app readiness.
startupProbe: if this fails, liveness/readiness probes aren't run until startup succeeds. Use for slow-starting apps.

containers:
- name: api
  image: myapp:1.2.3
  startupProbe:
    httpGet:
      path: /health
      port: 8080
    failureThreshold: 30    # 30 × 10s = 5 minutes for startup
    periodSeconds: 10

  readinessProbe:
    httpGet:
      path: /ready
      port: 8080
    initialDelaySeconds: 5
    periodSeconds: 5

  livenessProbe:
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 30
    periodSeconds: 15
    failureThreshold: 3    # restart after 3 consecutive failures

Lifecycle hooks:

postStart: runs immediately after the container starts. No guarantee it runs before the container's entrypoint.
preStop: runs before the container is terminated. Use to gracefully drain connections.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]  # wait for load balancer to remove this pod

The terminationGracePeriodSeconds (default 30s) gives the container time to respond to SIGTERM. If it doesn't exit within that period, SIGKILL is sent. preStop runs within this window.

Pod vs ECS task: the key differences

Both represent a group of co-located containers:

| | Kubernetes Pod | ECS Task | |---|---|---| | Network model | Shared network namespace (localhost) | awsvpc: each task gets own ENI | | Scheduling unit | Node | EC2 instance or Fargate | | Identity | Ephemeral (Deployment) or stable (StatefulSet) | Ephemeral | | IAM integration | IRSA via service account | Task role directly | | Sidecar injection | Admission webhooks (automatic) | Manual task definition | | Storage | PVC, emptyDir, configMap, secret volumes | EFS, EBS volumes |

The Kubernetes pod model is more flexible but more complex — shared localhost enables the sidecar pattern in ways that ECS task networking makes harder. ECS is operationally simpler but offers less container composition flexibility.

A Java application pod is repeatedly getting OOMKilled. The pod has memory limit=1Gi. Heap dumps show the JVM using 600MB heap. There's no obvious memory leak. What is the likely cause and fix?

medium

The JVM is configured with -Xmx512m. The pod's memory limit is 1Gi. The pod restarts every few hours. The application is otherwise functioning correctly.

AThe memory limit is too low for Java — Java needs at least 2Gi
Incorrect.Java memory requirements depend on the application. The issue is more specific than a generic '2Gi minimum'.
BJVM memory is not just heap. Off-heap (native memory, Metaspace, thread stacks, code cache) adds 200-400MB on top of heap. Total JVM footprint at 512m heap can easily exceed 1Gi
Correct!The JVM uses memory beyond -Xmx: Metaspace (class metadata), thread stacks (1MB default per thread), code cache (JIT compiled code), direct ByteBuffers, and OS overhead. A 512m heap JVM can easily use 800-900MB total, approaching the 1Gi limit. When GC pressure spikes or thread count grows, it tips over. Fix: set -Xmx to 60-70% of the container memory limit (e.g., -Xmx700m for 1Gi limit), or use -XX:MaxRAMPercentage=70.0 to auto-calculate from container limits.
CKubernetes OOMKilled means the container exceeded the node's memory, not the pod limit
Incorrect.OOMKilled is triggered when a container exceeds its own memory limit, not the node's total. The node's kubelet enforces per-container limits via cgroups.
DThe preStop hook is preventing proper JVM shutdown, causing memory accumulation
Incorrect.preStop hooks run during termination, not during normal operation. They wouldn't cause memory accumulation while the pod is running.

Hint:What does the JVM use memory for beyond the heap configured by -Xmx?