EKS in Production: Operator Patterns and the Path to Writing Your Own

8 min read

Practical lessons from running EKS at scale — when to use existing Kubernetes operators, when to build custom ones, and the architecture patterns that keep clusters manageable as complexity grows.

infrastructurekuberneteseksoperatorsplatform-engineering

EKS in Production: Operator Patterns and the Path to Writing Your Own

Running EKS in production teaches you things that no tutorial covers. You learn that the hard part isn't deploying a cluster — it's keeping it healthy, upgradeable, and understandable six months later when the engineer who set it up has moved teams. You learn that operators are the mechanism Kubernetes provides for encoding that operational knowledge, and that choosing the right operator strategy is one of the highest-leverage decisions a platform team makes.

This post covers the operator patterns I've used across multiple EKS production environments, when off-the-shelf operators are the right call, when you need a custom one, and how to build one without creating a maintenance nightmare.

EKS Foundation: Decisions That Compound

Before talking about operators, we need to talk about the EKS decisions that constrain everything downstream.

Managed node groups vs. Karpenter

This is the first fork in the road, and it determines your scaling architecture.

Managed node groups are the conservative choice. AWS handles the AMI lifecycle, drain logic, and ASG integration. They work well if your workloads are relatively homogeneous and you can tolerate some bin-packing inefficiency.

Karpenter is the right choice when you have heterogeneous workloads (GPU jobs alongside web services), when you need fast scale-up (managed node groups can take 3-5 minutes; Karpenter often provisions in under 60 seconds), or when you care about cost optimization through right-sizing.

In practice, most teams I've worked with end up on Karpenter once they're past 50 nodes. The operational overhead is lower than you'd expect — Karpenter's NodePool and EC2NodeClass CRDs are well-designed and the drift detection is solid.

Add-on management strategy

EKS add-ons (CoreDNS, kube-proxy, VPC CNI, EBS CSI) can be managed by AWS or self-managed. My recommendation:

  • AWS-managed add-ons for anything you don't need to customize. Let AWS handle the upgrade lifecycle.
  • Self-managed only when you need configuration that the managed add-on doesn't expose (e.g., custom VPC CNI settings for secondary CIDR ranges or prefix delegation).

The mistake I see teams make is self-managing everything "for control." That control comes with an upgrade burden that compounds every EKS version bump.

The Operator Landscape: What to Use and When

Kubernetes operators encode operational knowledge into software. Instead of a runbook that says "when X happens, do Y," an operator watches for condition X and performs action Y automatically. The operator pattern is a controller loop:

Observe (current state) → Compare (desired vs. actual) → Act (reconcile)

Tier 1: Operators you almost certainly need

These are battle-tested, well-maintained, and solve problems that almost every EKS cluster hits.

cert-manager — TLS certificate lifecycle. Handles Let's Encrypt and ACM PCA issuance, renewal, and distribution. Without it, you're writing cron jobs to rotate certs — and you'll forget one.

external-dns — Syncs Kubernetes Service/Ingress resources to Route53 (or other DNS providers). The alternative is manual DNS record management, which is a guaranteed source of drift.

external-secrets-operator — Syncs secrets from AWS Secrets Manager or SSM Parameter Store into Kubernetes Secrets. This keeps your secrets management centralized in AWS while making them available to pods transparently.

AWS Load Balancer Controller — Required for ALB/NLB integration with EKS. Manages Ingress and Service resources to provision and configure AWS load balancers.

Tier 2: Operators that earn their place under specific conditions

Argo CD — GitOps-based continuous delivery. Worth the operational overhead if your team has more than 2-3 people deploying to the cluster. Provides audit trail, drift detection, and rollback. Overkill for a single-team cluster.

Prometheus Operator — If you're running Prometheus in-cluster (as opposed to AWS Managed Prometheus), this operator makes ServiceMonitor and PodMonitor CRDs available for declarative scrape configuration. Significantly better than maintaining a giant prometheus.yml.

Kyverno or OPA Gatekeeper — Policy enforcement. Choose Kyverno if your team prefers YAML-native policies. Choose Gatekeeper if you need Rego's expressiveness. Either way, you need something enforcing policies like "no pods running as root" and "all images from approved registries."

Tier 3: When off-the-shelf isn't enough

This is where custom operators enter the picture. You need a custom operator when:

  1. Your reconciliation logic is domain-specific. No community operator handles your internal deployment topology, database provisioning workflow, or multi-tenant resource quota system.

  2. You're gluing multiple systems together. You need a controller that watches a Kubernetes CRD and reconciles state in an external system (an internal API, a legacy database, a proprietary service mesh).

  3. You need tighter control loops than Helm/Kustomize provide. Helm can template resources, but it can't react to runtime state changes. If you need "when pod X is unhealthy for 5 minutes, automatically failover to standby Y," that's an operator.

Building a Custom Operator: Architecture and Decisions

Framework selection

There are three serious options:

Kubebuilder (Go) — The Kubernetes project's official scaffolding tool. Generates controller-runtime-based projects. Best if your team knows Go and you want the closest-to-upstream experience. The generated code is idiomatic and the ecosystem (controller-runtime, client-go) is where Kubernetes itself lives.

Operator SDK (Go/Ansible/Helm) — Built on top of Kubebuilder with additional scaffolding for OLM (Operator Lifecycle Manager) integration. Use it if you plan to distribute your operator via OLM. The Go path is essentially Kubebuilder with extras.

kopf (Python) — Kubernetes Operator Pythonic Framework. Viable if your team is Python-first and the operator's reconciliation logic calls Python-native libraries (ML pipelines, data processing). Be aware that Python operators have higher memory overhead and slower startup than Go operators.

My recommendation: Kubebuilder (Go) unless you have a strong reason to use something else. The Go ecosystem for Kubernetes is vastly more mature, and you'll find more examples, more debugging resources, and better library support.

CRD design principles

Your Custom Resource Definition is your operator's API. Treat it with the same care you'd treat a public API.

Spec vs. Status separation. The spec is the user's desired state. The status is the operator's observed state. Never write to spec from the operator. Never let users write to status.

apiVersion: platform.example.com/v1alpha1
kind: DatabaseCluster
metadata:
  name: orders-db
  namespace: production
spec:
  engine: postgres
  version: "15.4"
  replicas: 3
  storage:
    size: 100Gi
    storageClass: gp3-encrypted
  backup:
    schedule: "0 2 * * *"
    retention: 30d
status:
  phase: Running
  replicas:
    ready: 3
    total: 3
  lastBackup: "2026-03-09T02:00:00Z"
  conditions:
    - type: Available
      status: "True"
      lastTransitionTime: "2026-03-08T14:30:00Z"
    - type: BackupHealthy
      status: "True"
      lastTransitionTime: "2026-03-09T02:05:00Z"

Use conditions, not phase enums, for complex state. A single phase field (Pending/Running/Failed) can't represent "the cluster is running but backups are failing." Conditions let you represent multiple orthogonal state dimensions.

Version your API. Start with v1alpha1. Graduate to v1beta1 when the API is stable for internal use. Graduate to v1 when you're confident in backward compatibility. Kubernetes has built-in conversion webhook support for migrating between versions.

The reconciliation loop

The reconciler is the core of your operator. Here's the pattern I follow:

func (r *DatabaseClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    var cluster v1alpha1.DatabaseCluster
    if err := r.Get(ctx, req.NamespacedName, &cluster); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    if cluster.DeletionTimestamp != nil {
        return r.handleDeletion(ctx, &cluster)
    }

    if err := r.ensureFinalizer(ctx, &cluster); err != nil {
        return ctrl.Result{}, err
    }

    // Reconcile sub-resources in dependency order
    if err := r.reconcileStorage(ctx, &cluster); err != nil {
        return r.setConditionAndRequeue(ctx, &cluster, "StorageReady", false, err)
    }

    if err := r.reconcileStatefulSet(ctx, &cluster); err != nil {
        return r.setConditionAndRequeue(ctx, &cluster, "Available", false, err)
    }

    if err := r.reconcileBackupCronJob(ctx, &cluster); err != nil {
        return r.setConditionAndRequeue(ctx, &cluster, "BackupHealthy", false, err)
    }

    return r.setHealthyStatus(ctx, &cluster)
}

Key patterns in this code:

  1. Finalizers for cleanup. If your operator creates external resources (AWS resources, DNS records, etc.), you must use finalizers to clean them up on deletion. Without finalizers, deleting the CR leaves orphaned resources.

  2. Dependency-ordered reconciliation. Reconcile sub-resources in the order they depend on each other. Storage before StatefulSet. StatefulSet before backup jobs.

  3. Condition-based status reporting. Each reconciliation step updates a specific condition. This gives users and monitoring systems fine-grained visibility into what's working and what isn't.

  4. Requeue on transient failures. Return ctrl.Result{RequeueAfter: time.Minute} for transient errors. Return ctrl.Result{} with an error only for unexpected failures that should trigger exponential backoff.

Testing your operator

Operator testing is notoriously difficult because you're testing interactions with the Kubernetes API server. Here's the testing pyramid I use:

Unit tests — Test pure logic functions (e.g., "given this spec, generate this StatefulSet manifest"). No API server needed. Fast, reliable, and should cover edge cases.

Integration tests with envtestcontroller-runtime provides envtest, which runs a real etcd and API server locally. Your reconciler runs against it. This catches issues like RBAC misconfigurations, status update conflicts, and watch event handling. Slower than unit tests but essential.

End-to-end tests in a real cluster — Use a Kind or k3d cluster in CI. Deploy the operator and create test CRs. Verify that sub-resources are created, status is updated, and deletion cleans up. These are slow and flaky — keep the suite small and focused on critical paths.

Operational Patterns for EKS + Operators

Upgrade strategy

EKS upgrades are the most anxiety-inducing maintenance task. Here's how to make them predictable:

  1. Test in a staging cluster first. Always. EKS version bumps can break operator compatibility.
  2. Upgrade add-ons and operators before upgrading the control plane. Most operators are backward-compatible with older API versions but may break with newer ones if they haven't been updated.
  3. Use PodDisruptionBudgets. Every stateful workload should have a PDB. Without one, node upgrades will drain all pods simultaneously.
  4. Pin operator versions. Don't auto-upgrade operators. Pin to specific versions in your GitOps manifests and upgrade deliberately.

Multi-tenant isolation

If your EKS cluster serves multiple teams:

  • Namespace-per-team with ResourceQuotas and LimitRanges.
  • NetworkPolicies to isolate namespace traffic (default-deny, then explicitly allow).
  • Kyverno/Gatekeeper policies to prevent privilege escalation, enforce image sources, and block host networking.
  • Hierarchical namespaces (via the HNC operator) if teams need sub-teams with inherited policies.

Observability for operators

Your custom operator needs the same observability as any production service:

  • Structured logging with reconciliation context (namespace, name, generation).
  • Prometheus metrics for reconciliation duration, error rate, and queue depth (controller-runtime exposes these by default — make sure you're scraping them).
  • Alerts on reconciliation failures that persist beyond the retry backoff (if a resource has been in a failed condition for 15 minutes, page someone).

Pitfalls from the Field

1. Reconciliation thundering herd. If your operator watches a CRD with 500 instances and you deploy a new version, all 500 reconciliations fire simultaneously. Use MaxConcurrentReconciles to limit parallelism and add jitter to requeue intervals.

2. Status update conflicts. The Kubernetes API uses resource versions for optimistic concurrency. If your reconciler reads a resource, reconciles for 30 seconds, then tries to update status, the resource version may have changed. Always re-fetch before status updates, or use status().Update() which only conflicts on status subresource changes.

3. Finalizer deadlocks. If your finalizer depends on an external service that's down, the CR can't be deleted. Implement a timeout: if cleanup hasn't succeeded after N attempts, log an error, remove the finalizer, and let the CR be garbage-collected. Alert on this — it means orphaned resources.

4. CRD schema evolution. Adding a required field to your CRD breaks existing resources. Always add new fields as optional with sensible defaults. Use webhook defaulting to backfill values on existing resources.

5. RBAC scope creep. Operators need RBAC permissions, and it's tempting to grant cluster-admin. Don't. Follow least-privilege: the operator should only have permissions on the specific resources it manages. Review RBAC rules during code review.

Readiness Checklist

Before promoting a custom operator to production:

  • [ ] CRD has validation webhooks (reject invalid specs early, not during reconciliation)
  • [ ] Defaulting webhook sets sensible defaults for optional fields
  • [ ] Finalizers implemented for all external resource cleanup
  • [ ] PodDisruptionBudget on the operator deployment itself
  • [ ] Leader election enabled (only one operator instance reconciles at a time)
  • [ ] Metrics endpoint exposed and scraped by monitoring
  • [ ] Runbook written for: operator crash loop, reconciliation stuck, finalizer deadlock
  • [ ] Load tested with 10x expected CR count to verify reconciliation throughput
  • [ ] Upgrade path tested: v1alpha1 → v1beta1 conversion webhook works
  • [ ] RBAC reviewed: no unnecessary cluster-wide permissions

Where This Is Heading

The Kubernetes operator ecosystem is maturing fast. Crossplane is blurring the line between operators and infrastructure-as-code. Cluster API is making cluster lifecycle itself operator-managed. The pattern of "encode operational knowledge in a control loop" is winning because it's fundamentally how distributed systems maintain consistency.

For platform teams on EKS, the practical advice is: lean heavily on community operators for well-solved problems, invest in custom operators only for your domain-specific operational logic, and treat your operator codebase with the same rigor as any production service — because that's exactly what it is.