Blog
Practical writing on AI engineering, infrastructure, backend systems, and production lessons learned.
Category
154 posts found
Archive
Browse the archive page by page for a faster, cleaner reading experience.
147 posts total
Kubernetes Deployments: Rolling Updates, Readiness Probes, and Why Rollbacks Fail Silently
•2 min read•Cloud InfrastructureA Deployment manages the lifecycle of stateless pods. The rolling update strategy, readiness probe integration, and PodDisruptionBudgets are the knobs that determine whether your deploy causes downtime — and whether a rollback actually works.
awsekskubernetesdeploymentKubernetes Workload Resources: When to Use Deployments, StatefulSets, DaemonSets, and Jobs
•2 min read•Cloud InfrastructureKubernetes has six workload resource types. Each represents a different scheduling and lifecycle contract. Choosing the wrong one — a Deployment for a database, a StatefulSet for a stateless API — creates subtle operational problems that don't surface until scale or failure.
awsekskubernetesKubernetes Pods: Resource Requests vs Limits, QoS Classes, and Sidecar Patterns
•3 min read•Cloud InfrastructureA pod is the unit of scheduling in Kubernetes. Resource requests tell the scheduler where to place a pod. Resource limits enforce runtime caps. The difference between requests and limits determines Quality of Service class — which determines which pods get killed first under memory pressure.
awsekskubernetesPod Scheduling: Node Affinity, Tolerations, and Security Contexts
•2 min read•Cloud InfrastructureThe pod spec controls more than containers and volumes. Node affinity constrains where pods are scheduled. Tolerations allow pods to land on tainted nodes. Security contexts define the privilege model. These settings live in the pod template and apply to every pod created by a workload resource.
awsekskubernetesReplicaSets and Deployments: Why You Almost Never Create a ReplicaSet Directly
•2 min read•Cloud InfrastructureA ReplicaSet keeps N pod replicas running. A Deployment manages ReplicaSets and adds rolling updates, rollback history, and update strategies on top. You almost always create Deployments, not ReplicaSets — but understanding the relationship between them explains what happens during every deployment.
awsekskubernetesKubernetes StatefulSets: Stable Identity, Persistent Storage, and When to Use Them
•3 min read•Cloud InfrastructureStatefulSets give each pod a stable name, a stable DNS record, and a persistent volume that follows it across reschedules. That stability is what databases and clustered applications need — and what makes StatefulSets harder to operate than Deployments.
awsekskubernetesstatefulFSx Storage Options: Lustre for HPC, Windows for SMB, and ONTAP for Enterprise NAS
•3 min read•Cloud InfrastructureAmazon FSx provides four managed file systems: Lustre (high-performance parallel I/O), Windows File Server (SMB/AD integration), NetApp ONTAP (enterprise NAS features), and OpenZFS. Choosing correctly depends on I/O pattern, client OS, and whether you need Windows-native features.
awsfsxstoragehpcIAM Roles in Terraform: Trust Policies, Inline Policies, and Common Patterns
•2 min read•Cloud InfrastructureCreating IAM roles in Terraform requires understanding four separate resources: the trust policy (who can assume), the role itself, the permission policy (what it can do), and the attachment linking them. Getting the trust policy wrong means the role exists but can't be assumed.
awsiamterraformIAM Policies: Trust Policy vs Permission Policy, Conditions, and When Managed Policies Break
•3 min read•Cloud InfrastructureIAM has two distinct policy types that serve completely different purposes. Trust policies control who can assume a role. Permission policies control what that role can do. Mixing them up — or failing to understand condition keys — produces access errors that are hard to debug.
awsiamsecurityAWS KMS Key Rotation: What Actually Happens and What Doesn't
•3 min read•Cloud InfrastructureKMS automatic key rotation creates new key material but doesn't re-encrypt existing data. Understanding how key versions work — and what rotation actually protects against — prevents misconfigured key management in production.
awskmssecurityencryptionLambda Execution Model: Handlers, Context, Cold Starts, and the Execution Environment
•2 min read•Cloud InfrastructureA Lambda function is a handler function inside an execution environment. The environment is reused across invocations (warm starts) but not guaranteed to exist (cold starts). Understanding the execution environment lifecycle — init, invoke, shutdown — explains why you initialize clients outside the handler and why cold start latency varies by runtime and memory.
awslambdaserverlessCustom Domains for Lambda Function URLs: Why CloudFront Is Required
•2 min read•Cloud InfrastructureLambda Function URLs come with an auto-generated domain that rejects requests with wrong Host headers. You can't point a CNAME directly to one. CloudFront is the only way to serve a Lambda Function URL under a custom domain — it rewrites the Host header to the function URL domain before forwarding.
awslambdacloudfrontroute53Lambda Layers: Shared Dependencies Without Bloated Deployment Packages
•2 min read•Cloud InfrastructureLambda layers let multiple functions share libraries, runtimes, and configuration without bundling duplicates into each deployment package. Understanding layer mounting, version pinning, and the 250 MB limit saves debugging time when layers behave unexpectedly.
awslambdaserverlessApplication Load Balancer: Listener Rules, Target Groups, and Health Check Configuration
•2 min read•Cloud InfrastructureALB routes HTTP/HTTPS traffic using content-based rules — path, hostname, headers, query strings. The listener evaluates rules in priority order and forwards to a target group. Health check configuration determines when targets are considered unhealthy and removed from rotation.
awsload-balanceralbALB Port Configuration: Listener Port, Target Group Port, and Per-Instance Port Overrides
•2 min read•Cloud InfrastructureAn ALB has four port-related settings: the listener port, target group default port, per-instance port, and health check port. They're independent and serve different purposes. The target group port acts as a default that per-instance overrides replace — which is how ECS dynamic port mapping works.
awsload-balanceralbecsALB Sticky Sessions: Duration-Based vs Application-Based Cookies and When to Avoid Them
•3 min read•Cloud InfrastructureSticky sessions bind a client to a specific target for the duration of their session. ALB offers two mechanisms: duration-based (ALB-managed cookie) and application-based (your app sets the cookie). Both work, but stickiness hides state management problems — stateless backends almost always serve you better.
awsload-balanceralbsessionsRoute 53 Record Types: A, AAAA, CNAME, Alias, MX, TXT, and When to Use Each
•2 min read•Cloud InfrastructureDNS record types serve different purposes. A/AAAA map names to IP addresses. CNAME maps one name to another. Route 53 Alias records extend A/AAAA to point at AWS resources without a CNAME restriction. Understanding when to use Alias vs CNAME — and why CNAME at zone apex is prohibited — prevents common DNS configuration mistakes.
awsroute53dnsRoute 53 Routing Policies: Weighted, Latency, Failover, Geolocation, and Health Checks
•2 min read•Cloud InfrastructureRoute 53 routing policies control how DNS responses are selected when multiple records exist for the same name. Weighted enables gradual traffic shifting. Latency routes to the lowest-latency region. Failover routes to a secondary when health checks fail. Each policy pairs with health checks differently.
awsroute53dnsRoute 53 Subdomain Delegation: Cross-Account DNS and ACM Certificate Verification
•2 min read•Cloud InfrastructureSubdomain delegation hands off DNS authority for a subdomain to a separate hosted zone — in the same or a different AWS account. The child zone's name servers go into the parent zone as NS records. Certificate validation for the delegated subdomain happens in the child zone, not the parent.
awsroute53dnsacmS3 Bucket Security: Block Public Access, Bucket Policies, and Object Ownership
•2 min read•Cloud InfrastructureS3 has two separate access control systems: ACLs (legacy, object-level) and bucket policies (IAM-style, resource-based). Block Public Access settings act as an override that supersedes both. Understanding how the four Block Public Access flags interact prevents accidental public exposure and unexplained access denials.
awss3securityiamS3 Glacier Storage Classes: Retrieval Tradeoffs, Lifecycle Policies, and Vault Lock
•3 min read•Cloud InfrastructureS3 Glacier is not a single service — it's three storage classes with very different retrieval times and costs. Choosing between Instant Retrieval, Flexible Retrieval, and Deep Archive determines whether you wait milliseconds or 12 hours to access your data.
awss3s3 glacierstorageSecrets Manager vs SSM Parameter Store: Rotation, Retrieval, and When to Use Each
•2 min read•Cloud InfrastructureSecrets Manager and SSM Parameter Store both store secrets, but they solve different problems. Secrets Manager supports automatic rotation with Lambda. Parameter Store is cheaper for configuration values that don't rotate. Understanding retrieval patterns — SDK, environment injection, Terraform — prevents secrets from landing in plaintext where they shouldn't.
awssecret managerssmsecuritySSM Agent: Session Manager, Run Command, and Replacing Bastions
•2 min read•Cloud InfrastructureSSM Agent replaces SSH and bastion hosts for EC2 access. It runs as root, which matters for Run Command blast radius. Session Manager requires no inbound ports — just the agent, an instance profile, and SSM VPC endpoints for private instances. Patch Manager and State Manager extend it beyond interactive access.
awsssmec2securityAWS Storage Gateway: File, Volume, and Tape Gateway for Hybrid Storage
•2 min read•Cloud InfrastructureStorage Gateway bridges on-premises environments and AWS storage. File Gateway exposes S3 as NFS/SMB shares. Volume Gateway provides iSCSI block storage backed by S3 snapshots. Tape Gateway replaces physical tape libraries with VTL. Each type has different caching behavior and failure modes when the WAN link degrades.
awsstorage gateways3hybrid cloud