All Projects
Professional2024

Platform Engineering — GitOps on EKS

Production infrastructure platform for a growing SaaS product — Terraform and Terragrunt across seven AWS accounts applied through Atlantis, Kubernetes workload delivery via ArgoCD with Flagger-based canary analysis, and two custom operators handling namespace provisioning and progressive delivery policy, replacing a manual process that blocked engineering teams for days at a time.

TerraformTerragruntAtlantisArgoCDApplicationSetsFlaggerKubernetesEKSKarpenterHelmGitHub ActionsGocontroller-runtimeKubebuilderOPA GatekeeperConftestExternal Secrets OperatorCert ManagerPrometheus OperatorAWS Load Balancer ControllerExternal DNSFluent BitAmazon ECRAWS Secrets ManagerAWS OrganizationsMSKAurora PostgreSQLElastiCacheOpenSearchTransit GatewayIRSAOIDC

Impact

Peak throughput5K req/s
p99 latency28ms (↓ from 140ms)
Annual compute savings~$124K (54% Spot)
AWS accounts7 (single region)
Deploy frequency50+ / day
MTTR8 min (auto-rollback)
Terraform modules18 reusable
Static credentialsEliminated

The Problem

Engineering headcount grew from roughly 40 to 100 people in about 18 months, and the infrastructure tooling did not grow with it. All AWS resources lived in a single Terraform root module with no state isolation — a plan that touched the production VPC shared state with the dev account. Applies ran from whoever had the right IAM credentials on their laptop, with no review step and no audit trail. Kubernetes manifests lived in individual service repositories and deployed by running kubectl apply directly from CI pipelines, which meant every pipeline held production cluster access credentials as long-lived secrets.

The day-to-day cost was measurable. Adding a new service required opening a platform ticket for namespace creation, RBAC setup, and resource quota — a request that typically took two to three days to fulfill. Staging and production drifted apart because changes applied during incidents were not always committed back to version control. Five platform engineers were the only people who could safely touch infrastructure, creating a dependency bottleneck across five product teams.

The immediate trigger for addressing this was a production incident caused by a developer running terraform destroy on the staging VPC while intending to target a sandbox account. There was no workspace isolation, no plan review, and no gate between a terminal command and an irreversible infrastructure change. The recovery took about 40 minutes. That incident forced a prioritization conversation that the day-to-day friction alone had not.

Architecture

IaC layer. The platform is split across two repositories. infra-cloud holds all AWS infrastructure as Terraform, composed with Terragrunt across seven accounts: management, security, shared-services, dev, staging, production, and sandbox. Eighteen reusable modules cover the full stack — vpc, transit-gateway, eks, karpenter, msk, aurora, elasticache, opensearch, iam-baseline, kms, acm, route53-zone, and others. Terragrunt's inputs inheritance and dependency declarations give each account its own state file and encode provisioning order as explicit code. Atlantis runs as a Kubernetes Deployment in the management cluster and handles all plan and apply operations triggered by pull requests. A Conftest step validates the Terraform plan JSON against OPA policies before any apply is permitted, blocking unencrypted resources, missing cost-allocation tags, and prohibited public resource types before they reach AWS.

Application delivery. infra-apps is the ArgoCD manifest repository. The root App of Apps is a Helm chart that renders child Application manifests for every platform component and service. Platform components — Karpenter, AWS Load Balancer Controller, External DNS, Cert Manager, External Secrets Operator, Prometheus Operator, Fluent Bit, OPA Gatekeeper, and Flagger — are pinned-version Applications with sync waves enforcing CRD-before-operator ordering. Service teams declare deployments as Helm values files; ApplicationSets with a cluster generator target the appropriate environment based on the values file path. A merged image tag propagates to dev immediately, reaches staging after a five-minute Prometheus health gate, and arrives in production behind an explicit Slack approval step.

Progressive delivery and operators. Flagger manages canary analysis for all production deployments using the AWS Load Balancer Controller ingress integration — no service mesh. A CanaryPolicy CRD, reconciled by a custom operator, translates each service's SLO declaration into a Flagger Canary object with platform-enforced defaults: minimum analysis window, a mandatory smoke test step before any traffic shift, and consistent Prometheus query templates that service teams cannot override. Traffic shifts in 10% increments; an SLO breach triggers an automatic full rollback. The TenantWorkspace operator handles namespace provisioning end-to-end — a single CR submission creates the namespace, RoleBindings from the spec's team field, ResourceQuota, LimitRange, NetworkPolicy with deny-external-ingress defaults, and an ArgoCD AppProject scoped to that team. Both operators are written in Go with Kubebuilder and controller-runtime and include admission webhooks for spec validation.

Key Decisions

  1. 1

    Terragrunt over native Terraform workspaces: with seven accounts, workspaces produce a single state blob with the combined blast radius of all accounts, or seven per-account root modules that diverge under maintenance. Terragrunt's inputs inheritance and dependency graph give each account its own state file and capture provisioning order (VPC before EKS, KMS before RDS) as explicit code rather than undocumented tribal knowledge.

  2. 2

    Atlantis over Terraform Cloud: Atlantis runs as a Deployment in the management cluster, so every change to its configuration is a PR, its availability is owned by the same team that owns the infrastructure, and there is no external SaaS in the critical path for provisioning changes. The platform team owns Atlantis uptime — at this team size, that tradeoff is acceptable for the cost savings and the ability to run custom workflow steps (specifically Conftest) before apply.

  3. 3

    Conftest in the Atlantis workflow over post-apply policy auditing: violations caught before apply cannot become infrastructure that needs to be cleaned up. Running Conftest against the Terraform plan JSON blocks a tagging violation or a publicly exposed S3 bucket before it exists in AWS. This was the direct operational response to the terraform destroy incident — workspace isolation prevents targeting the wrong account, and Conftest prevents resource misconfiguration within the correct one.

  4. 4

    Flagger with the AWS Load Balancer Controller ingress integration over Istio: at this team size the operational surface of a service mesh — sidecar injection, mTLS policy, control plane upgrades — would consume more platform engineering capacity than it would return in capability. Flagger's ALB-based traffic splitting delivers canary analysis without a mesh. The CanaryPolicy CRD schema can be extended to support more complex routing if requirements evolve, without changing the service team interface.

  5. 5

    CanaryPolicy CRD over per-service Flagger Canary configuration: giving each of 20 services its own Flagger Canary object directly produces 20 different analysis intervals, 20 different metric query styles, and no enforcement of a company-wide minimum analysis window. The operator translates a service-team-facing SLO declaration into a Flagger object with platform-enforced parameters. Service teams configure what matters to them; the platform enforces how the analysis runs.

  6. 6

    TenantWorkspace CRD over a self-service portal or Helm chart per namespace: a portal requires maintaining a separate service with its own API, deployment, and failure mode — disproportionate for a 100-person company. A Helm chart per namespace means platform engineers are manually running Helm installs. The operator keeps provisioning in the GitOps pipeline: the request is a PR, the execution is idempotent, and the state is inspectable via kubectl get tenantworkspace.

  7. 7

    IRSA over node IAM roles for pod-level AWS access: node IAM roles grant every pod on the node the same permissions — a compromised pod on the node can reach every AWS service that node role permits. IRSA binds an IAM role to a specific Kubernetes service account via the cluster OIDC provider, scoping permissions to what that service actually needs. The implementation cost per service is one annotated service account and one IAM role.

Outcomes

  • The terraform destroy incident that triggered the platform investment has not recurred. Workspace isolation prevents cross-account targeting; the Atlantis plan review gate means destructive changes require explicit approval before apply. Two subsequent attempts by engineers to reproduce the original conditions in sandbox were blocked at plan review.
  • Production deployments run through Flagger canary analysis for all services. Two automated rollbacks have occurred — one catching a latency regression from a slow database query, one catching an error rate increase from a misconfigured feature flag — before either reached full traffic. Neither required a human to initiate the rollback.
  • Peak production throughput of around 5K requests per second at p99 latency of 28ms, down from approximately 140ms before Karpenter-based right-sizing replaced the previously over-provisioned fixed node groups. The latency improvement came from better pod placement and instance type selection, not application changes.
  • Annual compute spend reduced by approximately $124K through Spot adoption. Karpenter manages 54% of production node-hours on Spot; measured Spot-caused disruptions — pods preempted and rescheduled — account for under 0.4% of total production requests.
  • Namespace provisioning via TenantWorkspace dropped from a two-to-three-day ticket response to under two minutes from CR submission to a running namespace with RBAC and quota. Over 30 namespaces have been provisioned across all environments since the operator launched, all without direct platform team involvement.
  • Five static IAM access keys that existed across GitHub secrets, developer laptops, and a shared secrets manager have been fully eliminated. OIDC federation for GitHub Actions and IRSA for EKS service accounts cover all credential needs. Key deletion was confirmed via a CloudTrail query across all seven accounts.
  • OPA Gatekeeper constraints have blocked twelve non-compliant resource submissions since deployment — unencrypted volumes, missing cost-allocation tags, and an overly permissive ClusterRoleBinding submitted during a rapid incident response. None required platform team involvement to catch.

Lessons Learned

  • 1Terragrunt dependency graphs are only as useful as the person who wrote them understood the provisioning order. The initial graph was missing the dependency between the iam-baseline module and the EKS module, which produced a cluster with an incomplete IAM configuration on first apply. The fix was straightforward; finding it required tracing a node registration failure to a Terraform-level gap, which took longer than expected because the error surfaced at the node level rather than during plan.
  • 2Flagger's automated rollback is only as reliable as the Prometheus queries it evaluates. One of the two correct rollbacks was nearly a false positive — a p99 spike caused by a cold-start penalty on the first deployment of a newly provisioned pod. The fix was a warmup delay in the canary analysis configuration. Canary metric queries need the same review discipline as the application code they evaluate; boilerplate queries produce false rollbacks that erode trust in the automation.
  • 3The TenantWorkspace admission webhook catches invalid specs before they reach the reconcile loop, which is the right design. What it does not catch is a spec that is syntactically valid but semantically wrong — a team name referencing an IAM group that no longer exists, for example. The reconcile loop failed silently on those cases for several days before alerting was added for reconciliation errors. Operator health metrics and reconcile failure alerts are not optional.
  • 4IRSA at scale means one IAM role per service account, which multiplies the number of IAM resources under management quickly. At 20 services across three clusters with multiple service accounts each, IAM role management became an Atlantis plan bottleneck — changes that touched a shared IAM boundary required re-planning a large portion of the state. The solution was splitting IAM role management into its own Terragrunt module with its own state file, isolating those plans from VPC and EKS changes.

Related Reading