TCP: How Reliable Delivery Actually Works — Blog

What TCP promises and what it costs

TCP gives you ordered, reliable, error-checked byte delivery between two endpoints. Every byte you write arrives intact, in order, exactly once, or the connection fails explicitly. No silent loss, no duplication, no reordering.

Those guarantees are not free. TCP pays for them with: a three-way handshake before any data moves, sequence numbers and acknowledgments on every segment, flow control to avoid overwhelming the receiver, congestion control to avoid overwhelming the network, and a TIME_WAIT state after the connection closes.

Each of these mechanisms has operational consequences. Production systems hit them regularly: connection pool exhaustion from TIME_WAIT, throughput degradation from small congestion windows, latency spikes from slow start. Knowing how the protocol works turns those into diagnosable problems.

The TCP connection model

ConceptTransport Layer Protocol

TCP is connection-oriented. Both endpoints maintain state — sequence numbers, window sizes, retransmit timers — for the lifetime of the connection. The connection is identified by a 4-tuple: (source IP, source port, destination IP, destination port).

Prerequisites

IP addressing
client-server model
ports and sockets

Key Points

TCP delivers a byte stream, not messages. The application layer defines message boundaries.
Every byte has a sequence number. The receiver acknowledges up to the highest contiguous byte received.
The send window limits how much unacknowledged data can be in flight at once.
TCP connections are full-duplex: each direction has its own sequence number space.

The three-way handshake: why three

Before data moves, both sides agree on initial sequence numbers. This takes three messages:

Client → Server: SYN  (seq=x)
Server → Client: SYN-ACK (seq=y, ack=x+1)
Client → Server: ACK  (ack=y+1)

Two messages are not enough. If the server sent SYN-ACK and the client sent data immediately, the server would not know whether the client received the SYN-ACK. The server's SYN-ACK could have been lost and retransmitted; the client might be replying to a stale copy. The third message — the final ACK — confirms that the client received the server's sequence number.

Four messages would add nothing. Three is the minimum to confirm both directions.

The practical cost: every new TCP connection adds one round-trip of latency before the first byte of data. For high-frequency short connections — HTTP/1.1 without keep-alive, microservices making per-request connections — this latency compounds. Connection pooling and HTTP/2 (multiplexing) exist specifically to avoid paying this cost repeatedly.

Sequence numbers, ACKs, and the sliding window

TCP sends segments. Each segment carries a sequence number (the byte offset of its first byte) and data. The receiver acknowledges by specifying the next byte it expects:

Sender sends: segment starting at byte 1000, length 500 → seq=1000
Receiver ACKs: ack=1500 (next expected byte)

The sliding window controls how many bytes the sender can have unacknowledged at once. The receiver advertises its receive window (rwnd) in every ACK — the amount of buffer space it has available. The sender cannot send beyond min(cwnd, rwnd) bytes past the last acknowledged byte.

|-- acknowledged --|-- in flight (unacked) --|-- can send --|-- must wait --|
                   ^                          ^
                  SND.UNA                  SND.UNA + window

This is flow control: the receiver throttles the sender to match its processing rate. When a slow application reads the receive buffer slowly, rwnd shrinks. If it hits zero, the sender stops and probes periodically with 1-byte segments until the window reopens.

Congestion control: protecting the network

Flow control protects the receiver. Congestion control protects the network.

TCP infers network congestion from packet loss (a retransmit timeout or three duplicate ACKs). It maintains a second window — the congestion window (cwnd) — and limits transmission to min(cwnd, rwnd).

Slow start: at connection open, cwnd starts at 1–10 segments and doubles each RTT until a threshold is reached or loss occurs. This is why a new connection to a distant server feels slow for the first few round-trips even on a fast link: TCP is still building up the window.

Congestion avoidance: once cwnd exceeds the threshold, it grows linearly (one segment per RTT) instead of doubling.

On loss: cwnd drops sharply. Newer algorithms (CUBIC, BBR) differ in how aggressively they reduce and recover.

💡Why slow start matters for latency-sensitive services

A fresh TCP connection to a CDN or API server starts with a small congestion window and grows it over the first few round-trips. If the response fits in the initial window (~14 KB for a 10-segment start), it arrives in one burst. If it does not, the second chunk waits for an RTT.

This is one reason why HTTP/2 multiplexing over a single persistent connection outperforms multiple HTTP/1.1 connections for small assets: the persistent connection's cwnd is already large, so each request benefits from the accumulated window.

TIME_WAIT: the state that surprises most engineers

After a connection closes, the endpoint that initiated the close (sent the first FIN) enters TIME_WAIT for 2 × MSL (Maximum Segment Lifetime, typically 60–120 seconds). The port combination is held and cannot be reused during this period.

Active closer (client):
FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT (2 × MSL) → CLOSED

Passive closer (server):
CLOSE_WAIT → LAST_ACK → CLOSED

Why does TIME_WAIT exist? Two reasons:

The final ACK might be lost. The passive closer retransmits its FIN. The active closer needs to be alive to re-send the ACK.
Delayed packets from the old connection must drain from the network before the 4-tuple is reused. Without TIME_WAIT, a new connection with the same IP/port pair might receive stale segments from the previous one.

The production problem: a server handling many short-lived connections (or a client making many outbound connections) can accumulate tens of thousands of TIME_WAIT sockets. When the ephemeral port range is exhausted, new connections fail with EADDRNOTAVAIL.

# Check TIME_WAIT count
ss -s | grep TIME-WAIT

# Tune the ephemeral port range (Linux)
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Enable port reuse for outbound connections
sysctl -w net.ipv4.tcp_tw_reuse=1

tcp_tw_reuse lets the kernel reuse a TIME_WAIT socket for new outbound connections if the timestamps indicate the old connection is truly gone. This is safe for clients. Do not use tcp_tw_recycle — it was removed in Linux 4.12 because it broke NAT environments.

📝The real fix for TIME_WAIT exhaustion

Tuning port ranges is a mitigation. The root fix is connection pooling. Reusing connections eliminates the close-per-request pattern that produces TIME_WAIT accumulation. HTTP/1.1 keep-alive, HTTP/2 multiplexing, and explicit connection pools in database clients all address this at the application layer.

TCP vs UDP: the actual tradeoff

TCP vs UDP

The choice is not about reliability preference — it is about who should manage retransmission and ordering.

TCP

Ordered, reliable, flow-controlled byte stream
Head-of-line blocking: a lost packet stalls all subsequent data until retransmitted
Connection setup overhead (handshake + slow start)
Right for: HTTP, databases, file transfer, any protocol requiring accuracy over speed

UDP

Unreliable, unordered datagrams — no retransmission
No head-of-line blocking: later packets are not held for earlier lost ones
Minimal overhead, low latency
Right for: real-time video/audio, DNS, gaming, protocols that implement their own reliability (QUIC)

Verdict

TCP's head-of-line blocking is its biggest weakness for latency-sensitive multiplexed streams. QUIC (HTTP/3) runs over UDP and implements its own per-stream reliability to avoid this: a lost packet in one stream does not block other streams.

Nagle's algorithm and when to disable it

Nagle's algorithm buffers small writes: the sender holds data until it has a full segment or all outstanding data is acknowledged. This improves throughput for applications that write small chunks — interactive terminal sessions, for example — by reducing the number of tiny packets.

For latency-sensitive applications — database clients, interactive APIs — Nagle's buffering adds delay. A 4-byte query payload waits for an ACK before it is sent.

Disable it with TCP_NODELAY:

// Go: set TCP_NODELAY on a net.Conn
tcpConn := conn.(*net.TCPConn)
tcpConn.SetNoDelay(true)

Most connection libraries for databases and HTTP clients disable Nagle's by default. Check yours if you see unexplained latency on small requests.

A service opens thousands of short-lived outbound connections per minute and starts receiving EADDRNOTAVAIL errors. What is the most likely cause?

medium

The server is on Linux. Each request opens a new TCP connection to a downstream API and closes it after the response.

AThe downstream API is rejecting connections due to rate limiting
Incorrect.Rate limiting would produce connection refused or 429 responses, not EADDRNOTAVAIL.
BThe ephemeral port range is exhausted because TIME_WAIT sockets are holding ports
Correct!Each closed connection enters TIME_WAIT for up to 2 minutes. At high connection rates, TIME_WAIT sockets accumulate and exhaust the ~28,000 default ephemeral ports. The OS cannot allocate a local port for new connections.
CThe server's network interface has too many IP addresses assigned
Incorrect.EADDRNOTAVAIL can occur from IP misconfiguration, but in the context of high connection rate it almost always means port exhaustion.
DThe kernel TCP buffer is full
Incorrect.A full receive buffer causes the advertised window to shrink to zero. The sender pauses but does not get EADDRNOTAVAIL.

Hint:EADDRNOTAVAIL means the OS cannot assign a local port. Think about what holds ports after a connection closes.