SSM Agent: Session Manager, Run Command, and Replacing Bastions

2 min readCloud Infrastructure

SSM Agent replaces SSH and bastion hosts for EC2 access. It runs as root, which matters for Run Command blast radius. Session Manager requires no inbound ports — just the agent, an instance profile, and SSM VPC endpoints for private instances. Patch Manager and State Manager extend it beyond interactive access.

awsssmec2security

What SSM Agent does

SSM Agent is a daemon that runs on EC2 instances (and on-premises servers). It polls SSM endpoints to receive instructions and report back status. This polling model means no inbound network ports are required — the instance initiates all connections outbound to AWS.

Four core capabilities:

| Capability | What it does | Replaces | |---|---|---| | Session Manager | Interactive shell access over SSM | SSH + bastion host | | Run Command | Execute scripts/commands across fleets | SSH + config management for one-offs | | Patch Manager | Scan and remediate missing OS patches | Third-party patch tools | | State Manager | Enforce configuration state (association documents) | Ad-hoc configuration drift |

Run Command runs as root — the blast radius is the entire instance

GotchaAWS SSM

SSM Run Command executes via the SSM Agent process, which runs as root on Linux (SYSTEM on Windows). A Run Command document that runs on 50 instances runs as root on all 50 simultaneously. There is no built-in privilege separation — the same IAM permission that lets you run a patch scan also lets you run rm -rf /.

Prerequisites

  • IAM policies
  • EC2 instance profiles
  • Systems Manager documents

Key Points

  • Run Command inherits the SSM Agent process user — root on Linux, SYSTEM on Windows.
  • IAM controls who can send commands, not what privilege level the command runs at.
  • Use resource tags and IAM Condition keys (ssm:resourceTag) to scope which instances a principal can target.
  • Audit Run Command execution via CloudTrail (SendCommand) and the command history in the Systems Manager console.

Session Manager: replacing bastions

A bastion host requires: a publicly accessible EC2 instance, security group rules allowing inbound SSH, key pairs distributed to engineers, ongoing patching of the bastion itself, and VPN or IP allowlisting for access control. Session Manager eliminates all of this.

Requirements for Session Manager:

  1. SSM Agent installed and running (pre-installed on Amazon Linux 2/2023, recent Ubuntu AMIs)
  2. Instance profile with AmazonSSMManagedInstanceCore managed policy (or equivalent)
  3. For private instances: SSM, SSM Messages, and EC2 Messages VPC endpoints
# Instance profile for Session Manager access
resource "aws_iam_role" "ssm_instance" {
  name = "ssm-instance-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ssm_core" {
  role       = aws_iam_role.ssm_instance.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_instance_profile" "ssm" {
  name = "ssm-instance-profile"
  role = aws_iam_role.ssm_instance.name
}

Starting a session via CLI:

# Start an interactive shell session
aws ssm start-session --target i-1234567890abcdef0

# Port forwarding: forward RDS port to local machine (no direct network access needed)
aws ssm start-session \
  --target i-1234567890abcdef0 \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters '{"host":["mydb.cluster-abc123.us-east-1.rds.amazonaws.com"],"portNumber":["5432"],"localPortNumber":["5432"]}'

# SSH over Session Manager (requires ssm-session-manager-plugin)
ssh -o ProxyCommand='aws ssm start-session --target %h --document AWS-StartSSHSession --parameters portNumber=%p' ec2-user@i-1234567890abcdef0

The port forwarding pattern is particularly useful: engineers access RDS, Redis, or internal services through an EC2 instance without those services having public endpoints or the instance having inbound ports open.

VPC endpoints for private instances

Instances in private subnets (no internet gateway, no NAT) can't reach SSM endpoints over the public internet. Three VPC endpoints are required:

variable "vpc_id" {}
variable "subnet_ids" { type = list(string) }
variable "region" { default = "us-east-1" }

locals {
  ssm_endpoints = [
    "com.amazonaws.${var.region}.ssm",
    "com.amazonaws.${var.region}.ssmmessages",
    "com.amazonaws.${var.region}.ec2messages",
  ]
}

resource "aws_vpc_endpoint" "ssm" {
  for_each = toset(local.ssm_endpoints)

  vpc_id              = var.vpc_id
  service_name        = each.value
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.subnet_ids
  security_group_ids  = [aws_security_group.ssm_endpoint.id]
  private_dns_enabled = true
}

resource "aws_security_group" "ssm_endpoint" {
  name   = "ssm-vpc-endpoints"
  vpc_id = var.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]  # your VPC CIDR
  }
}

With private_dns_enabled = true, the SSM Agent on private instances automatically resolves ssm.us-east-1.amazonaws.com to the VPC endpoint ENI — no agent configuration changes required.

📝Run Command: fleet-wide execution with targeting

Run Command executes SSM documents against instance fleets. Targeting by tag avoids maintaining explicit instance ID lists:

# Run shell script on all instances tagged Environment=production
aws ssm send-command \
  --document-name "AWS-RunShellScript" \
  --targets "Key=tag:Environment,Values=production" \
  --parameters 'commands=["systemctl restart nginx", "systemctl status nginx"]' \
  --output-s3-bucket-name my-ssm-output-bucket \
  --output-s3-key-prefix run-command-logs

# Check command status
aws ssm list-command-invocations \
  --command-id <command-id> \
  --details

Output goes to S3 when specified. Without S3, output is available in the console/CLI for a limited time and truncated at 48,000 characters.

Rate control parameters prevent simultaneous execution on too many instances:

  • --max-concurrency: number or percentage of instances to run on simultaneously (e.g., 10%)
  • --max-errors: stop after this many failures (e.g., 5 or 5%)
aws ssm send-command \
  --document-name "AWS-RunShellScript" \
  --targets "Key=tag:Environment,Values=production" \
  --parameters 'commands=["yum update -y"]' \
  --max-concurrency "25%" \
  --max-errors "10%"

Patch Manager: automated patching without SSH

Patch Manager runs patch baselines against instance fleets on a schedule. The default baseline for Amazon Linux 2 approves security patches automatically after 7 days.

# Custom patch baseline — approve critical patches immediately, others after 7 days
resource "aws_ssm_patch_baseline" "main" {
  name             = "custom-amazon-linux-2"
  operating_system = "AMAZON_LINUX_2"

  approval_rule {
    approve_after_days = 0  # immediately

    patch_filter {
      key    = "CLASSIFICATION"
      values = ["Security"]
    }
    patch_filter {
      key    = "SEVERITY"
      values = ["Critical", "Important"]
    }
  }

  approval_rule {
    approve_after_days = 7

    patch_filter {
      key    = "CLASSIFICATION"
      values = ["Bugfix"]
    }
  }
}

# Associate baseline with a patch group
resource "aws_ssm_patch_group" "main" {
  baseline_id = aws_ssm_patch_baseline.main.id
  patch_group = "production"  # matches tag PatchGroup=production on instances
}

Tag instances with PatchGroup=production to associate them with this baseline. Maintenance windows define when patching runs.

An EC2 instance in a private subnet has the AmazonSSMManagedInstanceCore policy attached to its instance profile, but aws ssm start-session fails with 'TargetNotConnected'. The instance has no internet access. What is the most likely cause?

easy

The instance was launched recently with Amazon Linux 2. The IAM permissions are correct. The instance can't reach any public endpoints — there's no NAT gateway or internet gateway in the subnet.

  • AThe SSM Agent is not installed on the instance
    Incorrect.Amazon Linux 2 ships with SSM Agent pre-installed and running. This is unlikely unless the AMI was customized to remove it.
  • BThe VPC endpoints for SSM, SSM Messages, and EC2 Messages are not configured — the agent can't reach the SSM service endpoints from the private subnet
    Correct!SSM Agent polls AWS SSM endpoints (ssm.region.amazonaws.com, ssmmessages.region.amazonaws.com, ec2messages.region.amazonaws.com) over HTTPS. Without internet access or VPC endpoints for these services, the agent can't register or receive sessions. The TargetNotConnected error means SSM has no active polling connection from this instance. Fix: create Interface VPC endpoints for all three services in the private subnet with private_dns_enabled=true.
  • CThe security group on the instance must allow inbound port 443 for Session Manager
    Incorrect.Session Manager requires no inbound ports. The agent initiates outbound connections to SSM endpoints. The security group on the instance only needs outbound 443 allowed (which is typically the default).
  • DAmazonSSMManagedInstanceCore doesn't include Session Manager permissions — a separate policy is required
    Incorrect.AmazonSSMManagedInstanceCore includes all permissions the instance needs for Session Manager, Run Command, and Patch Manager. No separate policy is required.

Hint:The agent needs to reach SSM service endpoints. What connectivity mechanism allows instances in private subnets to reach AWS services?