Route 53 Routing Policies: Weighted, Latency, Failover, Geolocation, and Health Checks — Blog

Why routing policies exist

A DNS record normally returns a single answer. Route 53 routing policies allow the same record name to return different answers based on criteria: traffic weights, requester location, endpoint health, or network latency. This enables DNS-level traffic management without changing application code.

Health checks: the prerequisite for active routing policies

ConceptRoute 53

Failover and multi-value policies depend on health checks. Route 53 health checkers (distributed globally) periodically poll your endpoints. An unhealthy record is excluded from DNS responses. Without health checks attached, Route 53 returns all records regardless of endpoint status.

Prerequisites

DNS TTL and caching
HTTP health check basics
Route 53 hosted zones

Key Points

Route 53 health checkers are separate from your application — they test from AWS's global infrastructure.
Calculated health checks: combine multiple health checks with AND/OR logic.
CloudWatch alarm health checks: an endpoint is healthy if a CloudWatch alarm is in OK state.
Health check evaluation is separate from DNS TTL — a record can become unhealthy before TTL expires.

Weighted routing: gradual traffic shifting

Weighted routing distributes traffic proportionally. Multiple records with the same name but different weights:

# 90% to production, 10% to canary
resource "aws_route53_record" "api_prod" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "production"
  ttl            = 60

  weighted_routing_policy {
    weight = 90
  }

  records = [aws_lb.production.dns_name]  # doesn't work with Alias — see note
}

resource "aws_route53_record" "api_canary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "canary"
  ttl            = 60

  weighted_routing_policy {
    weight = 10
  }

  records = [aws_lb.canary.dns_name]
}

Weight 0 sends no traffic to a record (effectively disables it without deleting it). If all records have weight 0, Route 53 returns all records equally.

For weighted routing to AWS resources (ALB, CloudFront), use Alias records with weighted policy:

resource "aws_route53_record" "api_prod" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "production"

  weighted_routing_policy {
    weight = 90
  }

  alias {
    name                   = aws_lb.production.dns_name
    zone_id                = aws_lb.production.zone_id
    evaluate_target_health = true
  }
}

Latency routing: route to lowest-latency region

Route 53 routes each DNS query to the record in the AWS region with the lowest measured latency for the requester. Route 53 maintains a database of latency measurements between client locations and AWS regions:

resource "aws_route53_record" "api_us" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "us-east-1"

  latency_routing_policy {
    region = "us-east-1"
  }

  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_eu" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "eu-west-1"

  latency_routing_policy {
    region = "eu-west-1"
  }

  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}

Latency routing is based on AWS's measured network latency data, not geographic proximity. A user in Germany might be routed to us-east-1 if Route 53's latency data shows lower latency there than eu-west-1. Combine with health checks so that if a region becomes unhealthy, traffic reroutes to the next-lowest-latency region.

Failover routing: active-passive DR

Failover routing has one primary and one secondary record. Traffic goes to primary when healthy; Route 53 returns secondary when primary's health check fails:

resource "aws_route53_health_check" "primary" {
  fqdn              = "api-primary.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "api_primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "primary"

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_secondary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "secondary"

  failover_routing_policy {
    type = "SECONDARY"
  }

  # Secondary doesn't need a health check — it's always returned when primary fails
  alias {
    name                   = aws_lb.dr.dns_name
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = true
  }
}

Route 53 health checkers poll from multiple global locations. The endpoint is marked unhealthy when a configurable number of checkers agree it's down. With failure_threshold = 3 and request_interval = 30, health status changes in ~90 seconds.

📝Geolocation routing: continent, country, or US state

Geolocation routing routes based on the requester's geographic location (not network latency). Use it for content localization, data residency requirements, or regulatory compliance.

# EU users go to EU stack
resource "aws_route53_record" "api_eu" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "eu-users"

  geolocation_routing_policy {
    continent = "EU"
  }

  alias {
    name    = aws_lb.eu.dns_name
    zone_id = aws_lb.eu.zone_id
    evaluate_target_health = true
  }
}

# US users
resource "aws_route53_record" "api_us" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "us-users"

  geolocation_routing_policy {
    country = "US"
  }

  alias {
    name    = aws_lb.us.dns_name
    zone_id = aws_lb.us.zone_id
    evaluate_target_health = true
  }
}

# Default: all other locations
resource "aws_route53_record" "api_default" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "default"

  geolocation_routing_policy {
    country = "*"   # default — matches any location not covered by other records
  }

  alias { ... }
}

Always create a default record (country = "*"). Without a default, locations not covered by specific geolocation rules receive NODATA responses — effectively a DNS failure for those users.

Geolocation uses the client's IP address to determine location. VPN and proxy users are located based on the VPN/proxy IP, not their physical location.

A Route 53 failover configuration has a PRIMARY record with a health check and a SECONDARY record. The primary endpoint has a 5-minute outage. During the outage, some clients successfully connect to the secondary, but others continue trying the primary and fail. The DNS TTL is 60 seconds. Why do some clients still try the primary?

medium

Route 53 health checkers detect the outage within 90 seconds (3 failures × 30s interval). Route 53 stops returning the primary record after that. But some clients experience failure for the full 5 minutes.

ARoute 53 health checks take 5 minutes to detect failures
Incorrect.With failure_threshold=3 and request_interval=30, Route 53 detects failure in ~90 seconds, not 5 minutes.
BRecursive DNS resolvers and client-side DNS caches hold the previous primary response until TTL expires — a client that resolved the primary 55 seconds ago may cache it for another 5 seconds, but earlier resolvers may hold it longer
Correct!DNS TTL is a maximum cache time — resolvers cache responses for up to TTL seconds. After Route 53 stops returning the primary, existing cached responses in DNS resolvers and client operating systems remain valid until they expire. With 60s TTL, a resolver that cached the primary 1 second before the outage started holds that cache for 59 more seconds. ISP resolvers often ignore TTL and cache longer. Some clients may see the old response for minutes. For faster failover, lower TTL before deployment (to 30s or less), or accept that DNS failover is not instantaneous.
CRoute 53 continues returning the primary during the grace period
Incorrect.There's no grace period. Once Route 53 marks the primary unhealthy, it stops returning it in DNS responses.
DThe SECONDARY record doesn't have a health check so Route 53 won't use it
Incorrect.Secondary records without health checks are returned whenever the primary is unhealthy. Health checks on secondary are optional.

Hint:Route 53 stops serving the primary after detection, but what about DNS responses that were already cached before the outage?