Pod Scheduling: Node Affinity, Tolerations, and Security Contexts

2 min readCloud Infrastructure

The pod spec controls more than containers and volumes. Node affinity constrains where pods are scheduled. Tolerations allow pods to land on tainted nodes. Security contexts define the privilege model. These settings live in the pod template and apply to every pod created by a workload resource.

awsekskubernetes

The pod template: shared spec for all pods in a workload

Every Deployment, StatefulSet, DaemonSet, and Job embeds a pod template — the spec that gets instantiated into actual pods. The template defines containers, but also scheduling constraints, security settings, and network configuration that determine where and how those containers run.

apiVersion: apps/v1
kind: Deployment
spec:
  template:           # ← everything inside here is the pod template
    metadata:
      labels:
        app: api
    spec:
      containers: [...]
      # scheduling constraints, security, volumes all go here

Three layers of scheduling control

ConceptKubernetes Scheduling

Kubernetes schedules pods using three layers of constraints: nodeSelector (simple label matching), nodeAffinity (expressive rules with required vs preferred), and taints/tolerations (node-side rejection with pod-side exceptions). Understanding which to use avoids misconfigured pods that either don't schedule or schedule on the wrong nodes.

Prerequisites

  • Kubernetes nodes and labels
  • pod spec
  • Kubernetes scheduler

Key Points

  • nodeSelector: hard requirement, node must have all specified labels. Simple but inflexible.
  • nodeAffinity: required (hard) or preferred (soft) rules with operators (In, NotIn, Exists, Gt).
  • Taints mark nodes as unsuitable for pods. Tolerations allow specific pods to override taints.
  • Pods without a toleration for a taint won't be scheduled on tainted nodes — useful for dedicated node groups.

nodeAffinity: expressive scheduling rules

spec:
  affinity:
    nodeAffinity:
      # Hard requirement: must be satisfied for scheduling
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values: ["m6i.xlarge", "m6i.2xlarge", "c6i.xlarge"]
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]

      # Soft preference: scheduler prefers but doesn't require
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node-group
            operator: In
            values: ["high-memory"]

requiredDuringSchedulingIgnoredDuringExecution — "ignored during execution" means if a node's labels change after the pod is scheduled, the pod stays running. There's no requiredDuringExecution variant that would evict running pods.

Taints and tolerations: dedicating nodes

Taints mark nodes as reserved for specific workloads. Without a matching toleration, pods won't be scheduled there.

# Taint a node group for GPU workloads
kubectl taint nodes node-gpu-1 workload=gpu:NoSchedule

# List all taints on a node
kubectl describe node node-gpu-1 | grep Taints

Only pods with matching tolerations can run on tainted nodes:

spec:
  tolerations:
  - key: "workload"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

  # Also add nodeAffinity to actively target these nodes
  # (tolerations allow scheduling on tainted nodes, but don't require it)
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: workload
            operator: In
            values: ["gpu"]

Taint effects:

  • NoSchedule: new pods without toleration won't be scheduled. Existing pods unaffected.
  • PreferNoSchedule: scheduler avoids the node but will use it if necessary.
  • NoExecute: existing pods without toleration are evicted. New pods without toleration won't schedule.

EKS applies system taints automatically — the node.kubernetes.io/not-ready:NoExecute taint is applied when a node goes unhealthy and removed when it recovers. This is how Kubernetes evicts pods from failed nodes.

Security contexts: pod and container privilege settings

Security contexts define the privilege model at the pod level (applies to all containers) and container level (overrides pod-level):

spec:
  # Pod-level security context
  securityContext:
    runAsNonRoot: true         # reject pods trying to run as root
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 2000              # supplemental group for volume mounts
    seccompProfile:
      type: RuntimeDefault     # seccomp filtering

  containers:
  - name: api
    image: myapp:1.2.3
    # Container-level overrides
    securityContext:
      allowPrivilegeEscalation: false   # prevent sudo/setuid escalation
      readOnlyRootFilesystem: true      # prevents writes to container root FS
      capabilities:
        drop: ["ALL"]                   # drop all Linux capabilities
        add: ["NET_BIND_SERVICE"]       # add only what's needed (port < 1024)

readOnlyRootFilesystem: true is a strong defense-in-depth measure. If a container is compromised, attackers can't write to the filesystem. Applications that write to local files need emptyDir or PVC mounts for their writable paths.

💡Pod anti-affinity: spreading pods across nodes and zones

To prevent all replicas of a Deployment from landing on the same node (and going down together during node failure), use pod anti-affinity:

spec:
  affinity:
    podAntiAffinity:
      # Hard: never two api pods on the same node
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["api"]
        topologyKey: "kubernetes.io/hostname"

      # Soft: prefer spreading across zones
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["api"]
          topologyKey: "topology.kubernetes.io/zone"

topologyKey: kubernetes.io/hostname means "no two pods with the matching label on the same node". topologyKey: topology.kubernetes.io/zone means "prefer not two pods in the same AZ".

Topology spread constraints (a newer mechanism) offer more control:

spec:
  topologySpreadConstraints:
  - maxSkew: 1                                  # max difference in pod count across zones
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api

maxSkew: 1 with DoNotSchedule ensures pods spread as evenly as possible across AZs, refusing to create imbalance of more than 1 pod. ScheduleAnyway is softer — it schedules even if the constraint can't be satisfied.

A Deployment has 6 replicas with requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity using topologyKey: kubernetes.io/hostname. The cluster has 4 nodes. What happens when you try to scale the Deployment to 7 replicas?

medium

Each node is healthy. The nodes have no taints. The Deployment currently has 6 running pods, one per node... wait, there are only 4 nodes. Actually reread: 6 replicas, 4 nodes.

  • AThe 7th pod schedules on a node that already has an api pod, violating anti-affinity
    Incorrect.requiredDuringScheduling means the constraint is a hard requirement. The scheduler will not violate it.
  • BScaling to 7 replicas succeeds — Kubernetes distributes pods optimally across 4 nodes
    Incorrect.With requiredDuringScheduling anti-affinity and topologyKey hostname, each pod requires a unique node. 7 pods cannot satisfy this with 4 nodes.
  • CThe 7th pod stays Pending indefinitely — required anti-affinity cannot be satisfied with only 4 available nodes
    Correct!requiredDuringSchedulingIgnoredDuringExecution is a hard constraint. With topologyKey: kubernetes.io/hostname, each api pod must be on a unique hostname. With 4 nodes and 6 replicas already running (2 per node since 6 > 4), there are already multiple pods per node. Wait — if anti-affinity is required, the first 4 pods placed successfully, and the 5th and 6th pods can't be placed. So the Deployment would already have 4 running and 2+ pending. The answer is: any pod beyond the node count stays Pending. The 5th through 7th pods would all be Pending.
  • DKubernetes automatically adds a new node to accommodate the 7th pod
    Incorrect.Kubernetes doesn't add nodes automatically — that's the Cluster Autoscaler's job. And Cluster Autoscaler only scales up when pods are Pending. The anti-affinity constraint would leave pods Pending, which could trigger CA if configured.

Hint:Required anti-affinity with topologyKey: hostname means one pod per node maximum. How many pods can you run on 4 nodes?