ECS on EC2: Container Instances, the ECS Agent, and What You're Responsible For

3 min readCloud Infrastructure

ECS EC2 launch type runs your containers on EC2 instances you manage. The ECS agent bridges EC2 and the ECS control plane. You control instance sizing, patching, cluster capacity, and task placement — which is both the power and the overhead compared to Fargate.

awsecsec2

The EC2 launch type: what you own

ECS has two launch types: EC2 and Fargate. With EC2, ECS schedules containers onto EC2 instances that you provision and manage. With Fargate, AWS handles the underlying instance.

The EC2 launch type makes sense when you need control over the instance: custom AMIs, GPU instances, specific network configurations, or cost optimization via Reserved Instances and Savings Plans. The cost is operational overhead.

What you're responsible for with ECS EC2:

  • Provisioning and registering EC2 instances into the ECS cluster
  • Patching the OS and ECS agent
  • Choosing and maintaining the ECS-optimized AMI
  • Managing cluster capacity (enough instances for your tasks)
  • Disk space management on instances

The ECS agent: how EC2 instances join the cluster

ConceptAWS ECS

The ECS agent is a process running on each EC2 instance that connects to the ECS control plane. It reports available resources (CPU, memory, ports), receives task placement decisions from ECS, and manages the container lifecycle (starting, stopping, monitoring containers via Docker).

Prerequisites

  • EC2 instances
  • Docker containers
  • IAM roles
  • Auto Scaling Groups

Key Points

  • ECS-optimized AMIs come with the ECS agent pre-installed. Use these as your base AMI.
  • The agent communicates outbound via HTTPS to the ECS endpoint — instances need either internet access or VPC endpoints.
  • Container instance role (EC2 instance profile) must have AmazonEC2ContainerServiceforEC2Role policy.
  • ECS agent config is at /etc/ecs/ecs.config — used to specify the cluster name during bootstrap.

Registering instances into the cluster

EC2 instances join a cluster by running the ECS agent with the cluster name configured. For Auto Scaling Group deployments, this goes in the launch template user data:

#!/bin/bash
echo ECS_CLUSTER=my-production-cluster >> /etc/ecs/ecs.config
echo ECS_ENABLE_CONTAINER_METADATA=true >> /etc/ecs/ecs.config

With Terraform:

resource "aws_launch_template" "ecs" {
  name_prefix   = "ecs-"
  image_id      = data.aws_ssm_parameter.ecs_ami.value  # ECS-optimized AMI from SSM
  instance_type = "c5.xlarge"

  iam_instance_profile {
    arn = aws_iam_instance_profile.ecs_agent.arn
  }

  user_data = base64encode(<<-EOF
    #!/bin/bash
    echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
    echo ECS_ENABLE_CONTAINER_METADATA=true >> /etc/ecs/ecs.config
    echo ECS_CONTAINER_STOP_TIMEOUT=120s >> /etc/ecs/ecs.config
  EOF
  )

  vpc_security_group_ids = [aws_security_group.ecs_instances.id]
}

# Always reference ECS-optimized AMI via SSM rather than hardcoding
data "aws_ssm_parameter" "ecs_ami" {
  name = "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
}

Using SSM Parameter Store for the AMI ID ensures you get the latest ECS-optimized AMI when the launch template is updated, without manually tracking AMI IDs per region.

Task placement: where your tasks land

When ECS schedules a task on EC2, it selects an instance using placement strategies and constraints:

Placement strategies define the preferred distribution:

  • spread by AZ: distribute tasks evenly across availability zones
  • spread by instance: spread across instances within an AZ
  • binpack by CPU/memory: consolidate onto the fewest instances (cost optimization)
  • random: distribute randomly

Placement constraints define hard requirements:

  • memberOf expression: only place on instances matching a condition (specific instance type, custom attribute)
  • distinctInstance: each task must be on a different instance
resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 6

  placement_constraints {
    type = "distinctInstance"  # no two tasks on the same instance
  }

  ordered_placement_strategy {
    type  = "spread"
    field = "attribute:ecs.availability-zone"  # spread across AZs first
  }

  ordered_placement_strategy {
    type  = "spread"
    field = "instanceId"  # then spread across instances within AZ
  }
}

Tasks fail to place (SERVICE_EVENT: unable to place a task) when no instance satisfies the constraints AND has sufficient CPU/memory. With distinctInstance, the maximum number of tasks equals the number of instances — if you have 4 instances and request 6 tasks, 2 tasks stay pending.

💡EC2 instance families for ECS workloads

Instance type choice affects task density — how many tasks fit on each instance. For a task requiring 512 CPU units (0.5 vCPU) and 1GB memory:

  • t3.medium (2 vCPU, 4GB): fits ~3-4 tasks (memory-bound)
  • c5.xlarge (4 vCPU, 8GB): fits ~7-8 tasks (memory-bound)
  • m5.2xlarge (8 vCPU, 32GB): fits ~16 tasks (CPU-bound)

Each ECS agent reserves CPU and memory for itself (~10% per instance). The actual available capacity for tasks is lower than the raw instance spec.

Common instance choices:

  • General workloads: m5 or m6i family (balanced CPU/memory)
  • CPU-heavy processing: c5 or c6i (compute-optimized)
  • Memory-heavy: r5 or r6i (memory-optimized)
  • GPU workloads: g4dn or p3 (GPU instances — not available in Fargate)
  • ARM/Graviton: m6g, c6g — ECS-optimized AMI available for arm64, often 20% cheaper than x86 equivalents

Mixed instance types in one ASG work but complicate capacity planning. Capacity providers with separate ASGs per instance family give cleaner control.

EC2 vs Fargate: the actual tradeoff

| Factor | EC2 | Fargate | |---|---|---| | Instance management | You patch, update AMIs | AWS handles | | Cost (steady-state) | Lower (RI/Savings Plans) | Higher per vCPU/hour | | Instance types | Any EC2 type, incl. GPU | Limited set | | Startup time | Fast (task on warm instance) | Slower (~10s Fargate cold start) | | Network modes | bridge, host, awsvpc | awsvpc only | | Disk | Instance storage available | 20GB ephemeral only | | Cluster capacity | You manage via ASG/capacity providers | AWS manages |

EC2 launch type is typically 20–40% cheaper for steady-state workloads when Reserved Instances or Savings Plans are applied. That cost advantage disappears at small scale (few instances) where operational overhead isn't amortized.

An ECS EC2 cluster has 5 instances (c5.xlarge, 4 vCPU / 8GB each). A service runs 20 tasks, each requiring 0.8 vCPU and 1.5GB. New deployments fail with 'unable to place a task'. The cluster shows available capacity. What is likely happening?

medium

The service uses a rolling deployment strategy. The task definition is unchanged. Cluster metrics show instances have free CPU and memory. The deployment has been stuck for 10 minutes.

  • AThe ECS agent is not running on some instances
    Incorrect.If the agent were down, ECS would mark those instances as unhealthy and not schedule tasks there — but it wouldn't explain available capacity metrics being reported.
  • BThe rolling deployment needs to place new tasks before stopping old ones, but the cluster has no capacity to run both the old and new tasks simultaneously
    Correct!ECS rolling deployments start new tasks before stopping old ones (the maximumPercent setting, default 200%, means up to 40 tasks can exist simultaneously). 20 existing tasks × 0.8 vCPU = 16 vCPU consumed. 5 instances × 4 vCPU = 20 vCPU total (minus agent reservation ≈ 18 usable). New tasks need CPU that isn't available because the old tasks are still running. Fix: use capacity providers with managed scaling to add temporary capacity during deployment, reduce maximumPercent, or add spare instances to the cluster.
  • CThe c5.xlarge instances don't have enough memory for 0.8 vCPU tasks
    Incorrect.A c5.xlarge has 8GB. At 1.5GB per task, it can fit ~5 tasks per instance, giving 25 total capacity for 5 instances. Memory is not the binding constraint here.
  • DThe task definition has a port conflict
    Incorrect.Port conflicts prevent task placement on specific instances but wouldn't cause cluster-wide placement failures when available capacity is reported.

Hint:Rolling deployment starts new tasks before stopping old ones. What does the cluster's total capacity look like during the transition?