High-Concurrency System Design: The Tradeoffs Nobody Warns You About

Every system handles 100 requests per second. The interesting engineering starts when you need to handle 100,000 — and the truly hard problems emerge when you need to handle 100,000 while maintaining correctness guarantees that your business depends on.

This post covers the concrete tradeoffs I've faced designing and operating high-concurrency backend systems. Not the whiteboard version — the version where you're staring at a dashboard at 2 AM watching p99 latency climb and trying to figure out which tradeoff to make right now.

The Fundamental Tension

High-concurrency system design is ultimately about managing a three-way tension:

        Throughput
           ▲
          / \
         /   \
        /     \
       /       \
Latency ─────── Correctness

You can optimize for any two, but the third will fight you. Most production systems need acceptable levels of all three, and the art is finding the right balance for your specific workload.

Let me make this concrete with the decisions you'll face at each layer.

Connection Management: The First Bottleneck

Before your application logic even runs, you need to accept and manage connections. This is where most systems hit their first concurrency ceiling.

Thread-per-connection vs. event loop

The thread-per-connection model (traditional Java servlet containers, Ruby on Rails with Puma) assigns one OS thread per client connection. Simple to reason about, but each thread costs 1-8 MB of stack memory. At 10,000 concurrent connections, you're consuming 10-80 GB just for thread stacks — before any application state.

The event loop model (Node.js, Go's goroutines, Rust's tokio) multiplexes many connections onto a small thread pool. Go's goroutines cost ~4 KB each. You can sustain 1 million concurrent connections on a single machine with 4 GB of memory.

The tradeoff isn't just memory. Thread-per-connection gives you preemptive multitasking — a slow handler can't starve other requests because the OS scheduler will preempt it. Event loops give you cooperative multitasking — a single blocking call in your handler (a synchronous database query, a CPU-intensive computation) blocks the entire event loop.

What I've seen work:

Go for services that mix I/O and CPU work. Goroutines are cheap, the scheduler is preemptive, and the runtime handles multiplexing.
Node.js/Rust+tokio for I/O-bound services (API gateways, proxy layers) where you can guarantee that all operations are non-blocking.
Java with virtual threads (Project Loom) if you're in a Java ecosystem. Virtual threads give you thread-per-connection ergonomics with event-loop efficiency.

Connection pooling: the math matters

Database connection pools are almost always the tightest bottleneck in a high-concurrency system. Here's the math most people get wrong:

If your average query takes 5ms and you have a pool of 20 connections, your theoretical maximum throughput is 20 connections × (1000ms / 5ms) = 4,000 queries/second. That's your ceiling, regardless of how many application threads you have.

Common mistake: making the pool too large. A pool of 200 connections doesn't give you 10x throughput — it gives you 200 connections competing for database CPU, disk I/O, and buffer pool memory. PostgreSQL performance typically degrades past 50-100 active connections due to lock contention and context switching overhead.

The formula I use as a starting point:

pool_size = (num_cpu_cores * 2) + effective_spindle_count

For a database on SSD (where effective spindle count is high), this simplifies to roughly num_cpu_cores * 2 + 1. A 4-core database server gets a pool of ~10. Then measure and adjust.

If you need more throughput than your pool supports, the answer is usually read replicas, caching, or query optimization — not a bigger pool.

Backpressure: the most important pattern you're probably not implementing

Backpressure is how a system signals that it's overloaded and callers should slow down. Without backpressure, overload cascades: the overloaded service accepts more work than it can handle, its response times increase, callers start timing out and retrying, which adds even more load, and the system collapses.

Concrete backpressure mechanisms:

1. Bounded queues with rejection. Your internal work queue should have a max size. When it's full, reject new requests with HTTP 503 immediately. A fast rejection is infinitely better than a slow timeout.

Incoming Request
      │
      ▼
 ┌──────────┐    Full?    ┌──────────┐
 │   Queue   │───────────→│  503     │
 │ (bounded) │            │ Reject   │
 └────┬─────┘            └──────────┘
      │
      ▼
   Workers

2. Adaptive concurrency limits. Instead of a static limit, dynamically adjust based on observed latency. Netflix's concurrency-limit library implements this: it starts with a low limit and increases it as long as latency stays within bounds. When latency spikes, it reduces the limit.

3. Rate limiting at the edge. Token bucket or sliding window rate limiters at the API gateway. These protect downstream services from individual clients that send bursts. Use HTTP 429 with Retry-After headers.

4. Load shedding. When overloaded, deliberately drop low-priority requests to preserve capacity for high-priority ones. This requires request classification — not all requests are equal.

Consistency Models: Choose Your Guarantees

High concurrency and strong consistency are fundamentally in tension. The stronger your consistency guarantees, the more coordination your system needs, and coordination is the enemy of throughput.

The spectrum in practice

Strong consistency (linearizability): Every read reflects the most recent write. Achieved with single-leader replication + synchronous reads from the leader. Throughput limited by the leader's capacity. Use when correctness is non-negotiable (financial transactions, inventory counts).

Eventual consistency: Reads may return stale data, but all replicas will converge. Achieved with asynchronous replication. High throughput and availability. Use when staleness is tolerable (social media timelines, analytics dashboards, product catalog browsing).

Causal consistency: If event A caused event B, everyone sees A before B. Weaker than strong consistency but stronger than eventual. Achieved with vector clocks or hybrid logical clocks. A practical middle ground for collaborative applications.

The decision framework I use

Ask these questions about each data path:

What happens if the user sees stale data? If the answer is "they see their friend's post 500ms late," eventual consistency is fine. If the answer is "they overdraw their bank account," you need strong consistency.
What's the write-to-read ratio? High-read, low-write paths can use caching aggressively with eventual consistency. High-write paths need careful consideration of conflict resolution.
What's the blast radius of inconsistency? If stale data affects one user's experience, that's tolerable. If stale data causes incorrect billing for all users, that's not.

CQRS: when you need both

Command Query Responsibility Segregation lets you use different consistency models for reads and writes:

                    ┌──────────────────┐
                    │  Command (Write)  │
                    │  Strong consistency│
   Write ──────────→│  Single leader DB  │
                    └────────┬─────────┘
                             │ Async replication
                             ▼
                    ┌──────────────────┐
   Read ───────────→│  Query (Read)     │
                    │  Eventually consistent│
                    │  Read replicas / cache│
                    └──────────────────┘

Writes go through a strongly-consistent path (single primary database). Reads are served from eventually-consistent read replicas or materialized views. The write path is your bottleneck, but most workloads are read-heavy (90-99% reads), so this buys you a lot.

The tradeoff: complexity. You now have two data paths, potential replication lag, and the need to handle stale reads gracefully in your application. Every user-facing flow needs to answer: "what if the read replica is 500ms behind?"

A common pattern is read-your-own-writes consistency: after a write, route that specific user's subsequent reads to the primary for a short window (e.g., 5 seconds), then fall back to replicas.

Caching: Not as Simple as "Add Redis"

Caching is the most effective throughput multiplier and the most dangerous source of consistency bugs.

Cache invalidation strategies

Write-through cache: Write to cache and database simultaneously. Simple, consistent, but doubles write latency and the cache fills with data that may never be read.

Write-behind (write-back) cache: Write to cache, asynchronously flush to database. Lower write latency, but data loss risk if the cache crashes before flushing. Use only with durable cache infrastructure and when you can tolerate the (small) data loss window.

Cache-aside (lazy loading): Read from cache; on miss, read from database and populate cache. Most common pattern. Simple, but the first request after a cache miss is slow, and there's a window where cache and database are inconsistent after a write.

What I default to: Cache-aside with explicit invalidation on writes. When a write occurs, delete the cache key (don't try to update it — updating introduces race conditions between concurrent writes).

The thundering herd problem

When a popular cache key expires, hundreds of concurrent requests simultaneously miss the cache and hit the database. This can overload the database and cause a cascade failure.

Mitigations:

Request coalescing (singleflight): Only one goroutine/thread fetches the data; all others wait for that result. Go's singleflight package does this.
Stale-while-revalidate: Serve the stale cached value while one background request refreshes it. The user gets slightly stale data, but the database isn't hammered.
Probabilistic early expiration: Each request has a small probability of refreshing the cache before it expires. With enough traffic, the cache is refreshed before it ever actually expires.

Cache sizing: the counterintuitive truth

More cache memory is not always better. A cache that's too large:

Evicts less frequently, which means stale data lives longer
Uses memory that could serve application workloads
Gives you a false sense of safety — when it does evict or restart, the database gets hit with the full uncached load

Size your cache for your working set, not your entire dataset. If 80% of reads hit 5% of your data (common in most applications), cache that 5% and let the long tail hit the database.

Queue-Based Decoupling: Converting Synchronous Pressure to Async Work

The most effective pattern for handling traffic spikes is converting synchronous request processing into asynchronous work.

The pattern

Client → API → Validate → Enqueue → Return 202 Accepted
                                          │
         ┌────────────────────────────────┘
         │ (async)
         ▼
      Workers → Process → Write Result → Notify Client

Instead of processing the request synchronously (and making the client wait), validate the input, enqueue a work item, and return immediately. Workers process the queue at their own pace.

When this works: Write-heavy operations, batch processing, anything where the client doesn't need the result in the same HTTP response.

When this doesn't work: Read operations, anything requiring sub-second response times, operations where the client needs the result to proceed.

Queue selection tradeoffs

SQS — Fully managed, automatically scales, at-least-once delivery. The right default for most async workloads on AWS. Downside: no ordering guarantees in standard queues (FIFO queues have ordering but lower throughput).

Kafka — Log-based, ordered within partitions, replay capability. The right choice when you need ordering, event sourcing, or multiple consumers processing the same events. Downside: operational complexity, partition management, consumer group coordination.

Redis Streams — Lightweight, fast, good for high-throughput low-durability workloads. Downside: limited durability guarantees compared to SQS/Kafka; data loss risk on Redis restart without persistence.

Dead letter queues: your safety net

Every queue needs a dead letter queue (DLQ). After N failed processing attempts, move the message to the DLQ instead of retrying forever. Then:

Alert on DLQ depth (messages in the DLQ mean something is broken)
Build tooling to inspect, replay, or discard DLQ messages
Include enough context in each message to debug without the original request context

Pitfalls That Only Appear at Scale

1. Connection exhaustion cascading. Service A runs out of database connections. Requests to A start timing out. Service B, which calls A, holds its own connections open waiting for A's response. Service B exhausts its connection pool. The failure propagates through the entire dependency chain. Mitigation: Circuit breakers on all inter-service calls. Set aggressive timeouts (2-5 seconds for synchronous calls).

2. Retry amplification. Client retries after 3 seconds. The API gateway retries after 5 seconds. The internal service retries after 2 seconds. A single failed request becomes 9 requests (3 × 3). During an outage, retry amplification can 10x your load. Mitigation: Retry budgets. Each layer in the stack gets a retry budget (e.g., "retry at most 10% of requests"). Use exponential backoff with jitter. Never retry on 4xx errors.

3. GC pauses under load. In garbage-collected languages (Java, Go, C#), high allocation rates under load trigger frequent GC pauses. These pauses look like latency spikes in your p99. Mitigation: Profile allocation under load. Reduce allocations on the hot path (object pooling, pre-allocation, avoid short-lived closures in tight loops). In Go, watch runtime/metrics for GC pause distributions.

4. Hot partitions. Your sharded database or partitioned queue has one partition receiving 80% of traffic because your shard key has poor distribution (e.g., sharding by customer ID when one customer generates most of the traffic). Mitigation: Use high-cardinality shard keys. Add a random suffix for known-hot keys. Monitor per-partition metrics.

5. Slow consumer poisoning. One slow consumer in a consumer group causes partition lag, which causes the broker to reassign partitions, which causes further consumer disruption, which causes more lag. Mitigation: Independent consumer health monitoring. Automatic removal of consumers that fall behind by more than a configurable threshold.

The Monitoring Stack You Need

You cannot operate a high-concurrency system without these metrics:

Connection-level:

Active connections per pool (database, HTTP, gRPC)
Connection wait time (how long requests queue for a connection)
Connection error rate (refused, timeout, reset)

Request-level:

Request rate (segmented by endpoint and status code)
Latency histograms (p50, p90, p95, p99 — not averages)
Error rate (5xx rate is the primary reliability signal)
In-flight request count (current concurrency)

Queue-level:

Queue depth (messages waiting to be processed)
Processing rate (messages/second)
Age of oldest message (time-to-process, not just depth)
DLQ depth and growth rate

System-level:

CPU utilization (sustained >70% means you're running too hot for spikes)
Memory utilization (and GC metrics for managed languages)
File descriptor count (connection exhaustion shows up here first)
Network socket states (TIME_WAIT accumulation indicates connection churn)

Production Readiness Checklist

Before going live with a high-concurrency service:

[ ] Load tested at 2x expected peak with realistic traffic patterns
[ ] Connection pool sizes tuned and validated (database, HTTP clients, gRPC channels)
[ ] Backpressure mechanism in place (bounded queues, rate limiting, or adaptive concurrency)
[ ] Circuit breakers on all outbound calls with tested fallback behavior
[ ] Retry strategy defined: exponential backoff + jitter + budget cap
[ ] Cache warming strategy documented (what happens after a cold start or cache flush?)
[ ] Graceful shutdown implemented (drain connections, finish in-flight work, stop accepting new requests)
[ ] DLQ configured for all async processing with alerting on depth
[ ] Latency budgets defined per dependency (if the database takes >50ms, shed load rather than propagate slowness)
[ ] Runbook written for: connection pool exhaustion, cache failure, queue backlog, retry storm
[ ] Chaos tested: what happens when one database replica fails? When Redis is unavailable? When a downstream service returns 500s for 5 minutes?

Closing Thought

The hardest part of high-concurrency system design isn't any single technique — it's the interaction between techniques. Your caching strategy affects your consistency model. Your consistency model affects your database load. Your database load affects your connection pool sizing. Your connection pool sizing affects your backpressure behavior.

Every decision constrains every other decision. The engineers who build reliable high-concurrency systems aren't the ones who know the most patterns — they're the ones who understand how those patterns interact in their specific system, under their specific load profile, with their specific failure modes.

Start with the simplest architecture that could work. Measure under realistic load. Add complexity only where measurements show you need it. And always, always implement backpressure before you think you need it — because by the time you need it, it's too late to add it.