Kubernetes Pods: Resource Requests vs Limits, QoS Classes, and Sidecar Patterns
A pod is the unit of scheduling in Kubernetes. Resource requests tell the scheduler where to place a pod. Resource limits enforce runtime caps. The difference between requests and limits determines Quality of Service class — which determines which pods get killed first under memory pressure.
Resource requests vs limits: two different things
Resource requests and limits look similar in pod spec but do completely different things:
Requests: the amount of CPU/memory the scheduler reserves on a node for this container. A node is considered "full" for a given resource when the sum of all container requests equals the node's allocatable capacity. Requests don't limit actual usage — a container can use more CPU than requested.
Limits: the hard cap enforced at runtime. CPU limits throttle (slow down) a container when it exceeds the limit. Memory limits kill the container with OOMKilled when it exceeds the limit.
resources:
requests:
cpu: "500m" # scheduler reserves 0.5 vCPU on the node
memory: "512Mi" # scheduler reserves 512MB on the node
limits:
cpu: "1000m" # container throttled if it exceeds 1 vCPU
memory: "1Gi" # container killed (OOMKilled) if it exceeds 1GB
Quality of Service classes: which pods die first
ConceptKubernetesKubernetes assigns a QoS class to each pod based on its request/limit configuration. Under memory pressure, the kubelet kills pods in QoS class order: BestEffort first, then Burstable, then Guaranteed last.
Prerequisites
- container resource requests and limits
- Linux OOM killer
- Kubernetes scheduling
Key Points
- Guaranteed: requests == limits for all containers. First to keep, last to evict.
- Burstable: requests set but lower than limits, or limits not set. Middle priority.
- BestEffort: no requests or limits set. First to be evicted under pressure.
- CPU overcommit is safe (throttling). Memory overcommit causes OOMKilled — set limits carefully.
# Guaranteed QoS: requests == limits
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # same as request
memory: "512Mi" # same as request
# Burstable QoS: requests < limits
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
# BestEffort QoS: no resources configured at all
# (don't do this in production)
For production workloads: set memory requests and limits equal (Guaranteed QoS) for critical pods. This prevents them from being evicted during memory pressure and prevents memory limit surprises. Set CPU requests conservatively and limits higher — CPU throttling is recoverable, OOMKilled is not.
Init containers: prerequisite tasks before the main container
Init containers run to completion in order before the main containers start. They share the same pod volumes but run sequentially. If an init container fails, the pod is restarted (respecting restartPolicy).
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c',
'until nc -z postgres-service 5432; do echo waiting for database; sleep 2; done']
- name: run-migrations
image: myapp:1.2.3
command: ["python", "manage.py", "migrate"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
containers:
- name: api
image: myapp:1.2.3
ports:
- containerPort: 8080
Init containers are the correct way to:
- Wait for dependent services (database, message queue) to be ready before starting
- Run database migrations before the application starts
- Clone a git repo or download config files before the main container starts
- Set up file system permissions before the main container runs
A failed init container prevents the pod from starting. This is intentional — you don't want an API pod to start if its database migration failed.
The sidecar pattern: running multiple containers in one pod
Containers in the same pod share network namespace (same IP, localhost communication) and can share volumes. This enables the sidecar pattern: a helper container augments the main container.
Common sidecars:
Log shipping: the main container writes logs to a shared volume; the sidecar tails and ships them to a log aggregator:
spec:
volumes:
- name: log-storage
emptyDir: {}
containers:
- name: api
image: myapp:1.2.3
volumeMounts:
- name: log-storage
mountPath: /var/log/app
- name: log-shipper
image: fluentd:v1.16
volumeMounts:
- name: log-storage
mountPath: /var/log/app
readOnly: true
resources:
requests:
cpu: "50m"
memory: "64Mi"
Service mesh proxy (Envoy/Linkerd): injected automatically by the mesh admission webhook. Intercepts all network traffic in/out of the pod for observability, retries, and mTLS. The main container code doesn't change.
Secret rotation: fetches secrets from Vault or AWS Secrets Manager and writes them to a shared volume. The main container reads secrets as files — when the sidecar updates them, the main container picks up changes without restart.
📝Lifecycle hooks and probes: controlling startup and shutdown
Kubernetes has three probe types and two lifecycle hooks:
Probes:
livenessProbe: if this fails, the container is killed and restarted. Use for detecting deadlocks.readinessProbe: if this fails, the pod is removed from Service endpoints. No traffic sent. Use for app readiness.startupProbe: if this fails, liveness/readiness probes aren't run until startup succeeds. Use for slow-starting apps.
containers:
- name: api
image: myapp:1.2.3
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30 # 30 × 10s = 5 minutes for startup
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3 # restart after 3 consecutive failures
Lifecycle hooks:
postStart: runs immediately after the container starts. No guarantee it runs before the container's entrypoint.preStop: runs before the container is terminated. Use to gracefully drain connections.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # wait for load balancer to remove this pod
The terminationGracePeriodSeconds (default 30s) gives the container time to respond to SIGTERM. If it doesn't exit within that period, SIGKILL is sent. preStop runs within this window.
Pod vs ECS task: the key differences
Both represent a group of co-located containers:
| | Kubernetes Pod | ECS Task | |---|---|---| | Network model | Shared network namespace (localhost) | awsvpc: each task gets own ENI | | Scheduling unit | Node | EC2 instance or Fargate | | Identity | Ephemeral (Deployment) or stable (StatefulSet) | Ephemeral | | IAM integration | IRSA via service account | Task role directly | | Sidecar injection | Admission webhooks (automatic) | Manual task definition | | Storage | PVC, emptyDir, configMap, secret volumes | EFS, EBS volumes |
The Kubernetes pod model is more flexible but more complex — shared localhost enables the sidecar pattern in ways that ECS task networking makes harder. ECS is operationally simpler but offers less container composition flexibility.
A Java application pod is repeatedly getting OOMKilled. The pod has memory limit=1Gi. Heap dumps show the JVM using 600MB heap. There's no obvious memory leak. What is the likely cause and fix?
mediumThe JVM is configured with -Xmx512m. The pod's memory limit is 1Gi. The pod restarts every few hours. The application is otherwise functioning correctly.
AThe memory limit is too low for Java — Java needs at least 2Gi
Incorrect.Java memory requirements depend on the application. The issue is more specific than a generic '2Gi minimum'.BJVM memory is not just heap. Off-heap (native memory, Metaspace, thread stacks, code cache) adds 200-400MB on top of heap. Total JVM footprint at 512m heap can easily exceed 1Gi
Correct!The JVM uses memory beyond -Xmx: Metaspace (class metadata), thread stacks (1MB default per thread), code cache (JIT compiled code), direct ByteBuffers, and OS overhead. A 512m heap JVM can easily use 800-900MB total, approaching the 1Gi limit. When GC pressure spikes or thread count grows, it tips over. Fix: set -Xmx to 60-70% of the container memory limit (e.g., -Xmx700m for 1Gi limit), or use -XX:MaxRAMPercentage=70.0 to auto-calculate from container limits.CKubernetes OOMKilled means the container exceeded the node's memory, not the pod limit
Incorrect.OOMKilled is triggered when a container exceeds its own memory limit, not the node's total. The node's kubelet enforces per-container limits via cgroups.DThe preStop hook is preventing proper JVM shutdown, causing memory accumulation
Incorrect.preStop hooks run during termination, not during normal operation. They wouldn't cause memory accumulation while the pod is running.
Hint:What does the JVM use memory for beyond the heap configured by -Xmx?