Pod Scheduling: Node Affinity, Tolerations, and Security Contexts
The pod spec controls more than containers and volumes. Node affinity constrains where pods are scheduled. Tolerations allow pods to land on tainted nodes. Security contexts define the privilege model. These settings live in the pod template and apply to every pod created by a workload resource.
The pod template: shared spec for all pods in a workload
Every Deployment, StatefulSet, DaemonSet, and Job embeds a pod template — the spec that gets instantiated into actual pods. The template defines containers, but also scheduling constraints, security settings, and network configuration that determine where and how those containers run.
apiVersion: apps/v1
kind: Deployment
spec:
template: # ← everything inside here is the pod template
metadata:
labels:
app: api
spec:
containers: [...]
# scheduling constraints, security, volumes all go here
Three layers of scheduling control
ConceptKubernetes SchedulingKubernetes schedules pods using three layers of constraints: nodeSelector (simple label matching), nodeAffinity (expressive rules with required vs preferred), and taints/tolerations (node-side rejection with pod-side exceptions). Understanding which to use avoids misconfigured pods that either don't schedule or schedule on the wrong nodes.
Prerequisites
- Kubernetes nodes and labels
- pod spec
- Kubernetes scheduler
Key Points
- nodeSelector: hard requirement, node must have all specified labels. Simple but inflexible.
- nodeAffinity: required (hard) or preferred (soft) rules with operators (In, NotIn, Exists, Gt).
- Taints mark nodes as unsuitable for pods. Tolerations allow specific pods to override taints.
- Pods without a toleration for a taint won't be scheduled on tainted nodes — useful for dedicated node groups.
nodeAffinity: expressive scheduling rules
spec:
affinity:
nodeAffinity:
# Hard requirement: must be satisfied for scheduling
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6i.xlarge", "m6i.2xlarge", "c6i.xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
# Soft preference: scheduler prefers but doesn't require
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-group
operator: In
values: ["high-memory"]
requiredDuringSchedulingIgnoredDuringExecution — "ignored during execution" means if a node's labels change after the pod is scheduled, the pod stays running. There's no requiredDuringExecution variant that would evict running pods.
Taints and tolerations: dedicating nodes
Taints mark nodes as reserved for specific workloads. Without a matching toleration, pods won't be scheduled there.
# Taint a node group for GPU workloads
kubectl taint nodes node-gpu-1 workload=gpu:NoSchedule
# List all taints on a node
kubectl describe node node-gpu-1 | grep Taints
Only pods with matching tolerations can run on tainted nodes:
spec:
tolerations:
- key: "workload"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# Also add nodeAffinity to actively target these nodes
# (tolerations allow scheduling on tainted nodes, but don't require it)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload
operator: In
values: ["gpu"]
Taint effects:
NoSchedule: new pods without toleration won't be scheduled. Existing pods unaffected.PreferNoSchedule: scheduler avoids the node but will use it if necessary.NoExecute: existing pods without toleration are evicted. New pods without toleration won't schedule.
EKS applies system taints automatically — the node.kubernetes.io/not-ready:NoExecute taint is applied when a node goes unhealthy and removed when it recovers. This is how Kubernetes evicts pods from failed nodes.
Security contexts: pod and container privilege settings
Security contexts define the privilege model at the pod level (applies to all containers) and container level (overrides pod-level):
spec:
# Pod-level security context
securityContext:
runAsNonRoot: true # reject pods trying to run as root
runAsUser: 1000
runAsGroup: 1000
fsGroup: 2000 # supplemental group for volume mounts
seccompProfile:
type: RuntimeDefault # seccomp filtering
containers:
- name: api
image: myapp:1.2.3
# Container-level overrides
securityContext:
allowPrivilegeEscalation: false # prevent sudo/setuid escalation
readOnlyRootFilesystem: true # prevents writes to container root FS
capabilities:
drop: ["ALL"] # drop all Linux capabilities
add: ["NET_BIND_SERVICE"] # add only what's needed (port < 1024)
readOnlyRootFilesystem: true is a strong defense-in-depth measure. If a container is compromised, attackers can't write to the filesystem. Applications that write to local files need emptyDir or PVC mounts for their writable paths.
💡Pod anti-affinity: spreading pods across nodes and zones
To prevent all replicas of a Deployment from landing on the same node (and going down together during node failure), use pod anti-affinity:
spec:
affinity:
podAntiAffinity:
# Hard: never two api pods on the same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["api"]
topologyKey: "kubernetes.io/hostname"
# Soft: prefer spreading across zones
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["api"]
topologyKey: "topology.kubernetes.io/zone"
topologyKey: kubernetes.io/hostname means "no two pods with the matching label on the same node". topologyKey: topology.kubernetes.io/zone means "prefer not two pods in the same AZ".
Topology spread constraints (a newer mechanism) offer more control:
spec:
topologySpreadConstraints:
- maxSkew: 1 # max difference in pod count across zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api
maxSkew: 1 with DoNotSchedule ensures pods spread as evenly as possible across AZs, refusing to create imbalance of more than 1 pod. ScheduleAnyway is softer — it schedules even if the constraint can't be satisfied.
A Deployment has 6 replicas with requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity using topologyKey: kubernetes.io/hostname. The cluster has 4 nodes. What happens when you try to scale the Deployment to 7 replicas?
mediumEach node is healthy. The nodes have no taints. The Deployment currently has 6 running pods, one per node... wait, there are only 4 nodes. Actually reread: 6 replicas, 4 nodes.
AThe 7th pod schedules on a node that already has an api pod, violating anti-affinity
Incorrect.requiredDuringScheduling means the constraint is a hard requirement. The scheduler will not violate it.BScaling to 7 replicas succeeds — Kubernetes distributes pods optimally across 4 nodes
Incorrect.With requiredDuringScheduling anti-affinity and topologyKey hostname, each pod requires a unique node. 7 pods cannot satisfy this with 4 nodes.CThe 7th pod stays Pending indefinitely — required anti-affinity cannot be satisfied with only 4 available nodes
Correct!requiredDuringSchedulingIgnoredDuringExecution is a hard constraint. With topologyKey: kubernetes.io/hostname, each api pod must be on a unique hostname. With 4 nodes and 6 replicas already running (2 per node since 6 > 4), there are already multiple pods per node. Wait — if anti-affinity is required, the first 4 pods placed successfully, and the 5th and 6th pods can't be placed. So the Deployment would already have 4 running and 2+ pending. The answer is: any pod beyond the node count stays Pending. The 5th through 7th pods would all be Pending.DKubernetes automatically adds a new node to accommodate the 7th pod
Incorrect.Kubernetes doesn't add nodes automatically — that's the Cluster Autoscaler's job. And Cluster Autoscaler only scales up when pods are Pending. The anti-affinity constraint would leave pods Pending, which could trigger CA if configured.
Hint:Required anti-affinity with topologyKey: hostname means one pod per node maximum. How many pods can you run on 4 nodes?