Sockets: The API Between Application Code and the Network Stack — Blog

What a socket is

A socket is a file descriptor representing one endpoint of a network connection. At the OS level, it is an integer handle — the same abstraction Linux uses for files, pipes, and devices. read() and write() work on sockets the same way they work on files.

The application creates a socket, the kernel maintains the connection state (TCP sequence numbers, receive buffer, send buffer), and the socket API is the interface between them.

// The minimal server: create socket, bind, listen, accept, read, write
int sock = socket(AF_INET, SOCK_STREAM, 0);   // create TCP socket
bind(sock, &addr, sizeof(addr));               // assign address
listen(sock, 128);                             // mark as passive, backlog=128
int conn = accept(sock, &client_addr, &len);   // wait for connection
char buf[1024];
read(conn, buf, sizeof(buf));                  // receive data
write(conn, response, response_len);           // send response
close(conn);                                   // done

The kernel's accept queue and TCP handshake

ConceptOperating Systems / Networking

When a TCP SYN arrives at a listening socket, the kernel completes the three-way handshake without the application being involved. The completed connection sits in the accept queue. The application retrieves it by calling accept().

Prerequisites

TCP three-way handshake
file descriptors
process basics

Key Points

The kernel manages TCP connection state independently of when the application calls accept().
The listen() backlog parameter sets the maximum connections waiting in the accept queue.
If the queue fills (application not calling accept() fast enough), incoming SYNs are dropped. The client retries.
Each accepted connection is a new socket with its own file descriptor — the listening socket stays open for future connections.

Blocking vs non-blocking sockets

By default, socket operations block: read() waits until data arrives, accept() waits until a connection comes in, connect() waits until the TCP handshake completes.

For a server handling one connection at a time, this is fine. For a server handling thousands of simultaneous connections, blocking means you need a thread per connection — the thread-per-connection model that fails at scale (see C10K problem: 10,000 simultaneous threads at 1MB stack each = 10 GB of memory just for stacks).

Non-blocking sockets return immediately with EAGAIN or EWOULDBLOCK if the operation would block. The application must re-check when the kernel indicates data is available.

// Set socket to non-blocking
int flags = fcntl(sock, F_GETFL, 0);
fcntl(sock, F_SETFL, flags | O_NONBLOCK);

// Read returns immediately — check the return value
ssize_t n = read(sock, buf, sizeof(buf));
if (n == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
    // No data available right now — come back later
}

Polling the socket in a loop wastes CPU. The solution is to let the kernel notify you when a socket is ready.

epoll: efficient event notification

select() and poll() check readiness across a set of file descriptors. Both have O(n) cost proportional to the number of descriptors watched — they iterate over the entire set on each call.

epoll (Linux-specific) uses a kernel-maintained event table. You register file descriptors once. The kernel pushes ready events to a completion queue. Your application calls epoll_wait() and gets only the events that are ready.

// Create epoll instance
int epfd = epoll_create1(0);

// Register the listening socket
struct epoll_event ev = { .events = EPOLLIN, .data.fd = listen_sock };
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_sock, &ev);

// Event loop
while (1) {
    struct epoll_event events[64];
    int n = epoll_wait(epfd, events, 64, -1);  // block until events ready

    for (int i = 0; i < n; i++) {
        if (events[i].data.fd == listen_sock) {
            // New connection ready to accept
            int conn = accept(listen_sock, NULL, NULL);
            // Register conn with epoll
            ev.data.fd = conn;
            epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev);
        } else {
            // Existing connection has data
            handle_request(events[i].data.fd);
        }
    }
}

This is the foundation of event-driven servers: nginx, Node.js, and libuv all use epoll (on Linux) or kqueue (on macOS/BSD) for this pattern. One thread handles thousands of connections by multiplexing I/O events.

📝Edge-triggered vs level-triggered epoll

epoll supports two notification modes:

Level-triggered (default): epoll_wait returns a socket as ready as long as data is available. If you don't read all the data, the next epoll_wait call returns it again.

Edge-triggered (EPOLLET): epoll_wait returns a socket as ready only when its state changes (new data arrives). If you don't read all the data in one shot, you won't be notified again until more data arrives.

Edge-triggered requires reading in a loop until EAGAIN:

// Edge-triggered: must drain the socket
while ((n = read(conn, buf, sizeof(buf))) > 0) {
    process(buf, n);
}
// n == -1 && errno == EAGAIN → fully drained

Edge-triggered is more efficient (fewer epoll_wait syscalls) but requires careful coding — missing data because you didn't fully drain is a common bug.

The socket options that matter in production

SO_REUSEADDR: allows binding to a port that is in TIME_WAIT state. Without this, a server restart within 2 minutes of shutdown fails with "address already in use." Almost always enabled on server sockets.

int yes = 1;
setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(yes));

SO_REUSEPORT: allows multiple sockets to bind to the same port. Multiple worker processes (or threads) each call accept() on their own socket — the kernel load-balances incoming connections across them. Eliminates the single accept bottleneck.

TCP_NODELAY: disables Nagle's algorithm. Send small packets immediately without waiting for a full segment or ACK. Essential for latency-sensitive protocols (database clients, interactive services).

SO_KEEPALIVE: the kernel sends periodic TCP keepalive probes on idle connections. If the peer is gone (crashed, network drop), the connection is detected and closed. Without keepalives, idle connections remain open indefinitely.

Unix domain sockets for local IPC

When both ends of a connection are on the same machine (application to local database, nginx to upstream app server), Unix domain sockets skip the network stack entirely. They communicate through the kernel using file paths instead of IP:port.

# Server (Python)
import socket, os
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
sock.bind('/tmp/myservice.sock')
sock.listen(5)

# Client
client = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
client.connect('/tmp/myservice.sock')

Unix domain sockets are 2-5x faster than loopback TCP (127.0.0.1) for local IPC because they bypass TCP/IP processing, checksumming, and loopback routing. nginx to PHP-FPM, Redis (when client and server are co-located), and PostgreSQL all support Unix domain socket connections.

A server starts accepting connections after a crash and restart, but the first `bind()` call fails with 'Address already in use'. No other process is using the port. What is happening and how should it be fixed?

easy

The server was recently killed (SIGKILL). The port is in TIME_WAIT state because the server initiated connection closes. The server does not set any socket options before binding.

AThe kernel needs to be rebooted to clear the TIME_WAIT state
Incorrect.TIME_WAIT clears after 2 × MSL (typically 60-120 seconds). Rebooting is unnecessary.
BSet SO_REUSEADDR before bind() — this allows binding to a port in TIME_WAIT state
Correct!SO_REUSEADDR explicitly allows a new socket to bind to a port that is in TIME_WAIT state. Without it, bind() fails for up to 2 minutes after the previous server's connections drain. It should always be set on server listening sockets before bind() is called.
CUse a different port number to avoid the conflict
Incorrect.Changing the port is a workaround but not a fix. SO_REUSEADDR is the standard solution and lets you reuse the same port immediately.
DThe port is blocked by the firewall and must be reopened
Incorrect.A firewall block produces 'connection refused' or timeout at the client side. bind() failing with 'address already in use' is a kernel-level socket state issue, not a firewall issue.

Hint:TIME_WAIT holds the port for 60-120 seconds after connection close. What socket option allows immediate reuse?