Coroutines, Threads, and Goroutines: Concurrency Without the OS Overhead — Blog

The core distinction

A thread is preempted — the OS scheduler decides when to pause it and switch to another. A coroutine yields — it decides when to hand control back. That one difference in who controls scheduling determines almost everything about the performance, complexity, and failure modes of each model.

This is not a matter of one being better. They solve different problems. The question is which one fits the work.

Scheduling models: preemptive vs cooperative

ConceptOperating Systems

The OS schedules threads based on time slices and priorities. Coroutines schedule themselves by explicitly yielding at defined suspension points. Both achieve concurrency, but through opposite mechanisms.

Prerequisites

processes and threads
OS scheduler basics
I/O blocking

Key Points

Preemptive: OS can pause a thread at any instruction. No code changes needed to be concurrent.
Cooperative: coroutine runs until it explicitly yields (await, yield, channel send). Nothing pauses it otherwise.
Preemptive threads can block the OS scheduler if they hang. Cooperative coroutines block their entire thread if they forget to yield.
Context switch cost: OS thread switch saves and restores kernel state (~microseconds). Coroutine switch is user-space only — often nanoseconds.

What a context switch actually costs

When the OS switches between threads, it saves the full execution context: registers, stack pointer, program counter, and flushes CPU caches that were warm for the previous thread. The kernel is involved. On a modern CPU this costs roughly 1–10 microseconds, depending on the work done.

When a coroutine yields, it saves only what it needs to resume: typically the program counter and any local state the runtime tracks. No kernel involvement. The CPU cache may stay warm if coroutines run on the same thread. Cost is in the nanosecond range.

This is why systems like nginx or Node.js can handle tens of thousands of simultaneous connections without threads. Each connection is an event source. When a connection has no data, the coroutine/callback for it simply does not run.

Python asyncio: single-threaded concurrency

Python's asyncio runs all coroutines on a single thread. The event loop calls coroutines, they run until they hit an await, then the event loop picks up the next ready coroutine.

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks)

asyncio.gather runs all fetches concurrently. While one is waiting for the HTTP response, others run. No threads, no locks.

The critical constraint: if any coroutine blocks without yielding, the entire event loop stalls. A CPU-intensive operation — image processing, SHA256 computation, a tight loop — that runs inside a coroutine will freeze every other connection until it completes.

# This will freeze all other connections for the duration of the computation
async def bad_handler():
    result = cpu_intensive_computation()  # never yields — blocks event loop
    return result

# Correct: run CPU work in a thread pool executor
async def good_handler():
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(None, cpu_intensive_computation)
    return result

asyncio is the right tool for I/O-bound workloads with many concurrent connections. It is the wrong tool for CPU-bound workloads. Use a process pool for those.

Go goroutines: M:N scheduling

Go takes a different approach. Goroutines are not purely cooperative and not purely preemptive — they use an M:N scheduler: M goroutines multiplexed across N OS threads (where N = GOMAXPROCS, defaulting to CPU count).

The Go runtime scheduler preempts goroutines at safe points (function calls, channel operations). Since Go 1.14, it can also preempt goroutines in tight loops via signal-based preemption. Goroutines do not have to manually yield — the runtime handles it.

// Each of these runs concurrently. The runtime schedules them.
for _, url := range urls {
    go func(u string) {
        resp, err := http.Get(u)
        // ... handle response
    }(url)
}

Goroutines start with a 2–8KB stack (vs 1–8MB for OS threads) and grow dynamically. You can run a million goroutines on a single machine. OS threads at that count would exhaust memory.

💡Why goroutines can block on I/O without stalling other goroutines

When a goroutine performs a blocking syscall (file read, network I/O), the Go runtime moves it off its OS thread. The OS thread handles the syscall. Another goroutine gets scheduled on a different OS thread. When the syscall completes, the blocked goroutine becomes runnable again and waits for an available thread.

This is the key difference from Python's event loop model: you can write blocking-style code in Go (resp, err := http.Get(url)) without stalling other goroutines, because the runtime transparently multiplexes.

Thread-per-connection vs coroutines: the numbers

A web server handling 10,000 concurrent connections:

Thread-per-connection: 10,000 × 1MB stack = 10GB memory minimum. Context switching overhead at this scale is measurable. Linux has a default thread limit around 32,768.
asyncio/Node.js event loop: 10,000 coroutines/callbacks on 1 thread. Memory footprint is small. CPU is fully utilized only if I/O dominates — any blocking code is catastrophic.
Go goroutines: 10,000 × 4KB stack = 40MB. Runtime handles multiplexing. Blocking code works without special handling.

The right choice depends on the workload profile, not ideology.

Threads vs coroutines vs goroutines

These are not competing options — they model different tradeoffs between control, overhead, and scalability.

OS Threads

Preemptive — OS decides when to switch
1–8MB stack per thread; ~30K limit on Linux
True parallelism on multi-core CPUs
Blocking I/O is fine — thread just waits
Synchronization requires locks, which introduce contention

Coroutines / Goroutines

Cooperative or M:N — runtime or explicit yield
2–64KB stack; millions possible
Concurrency (interleaving) rather than parallelism per coroutine
Python asyncio: blocking I/O freezes the event loop
Go goroutines: blocking I/O handled transparently by runtime

Verdict

Use OS threads when work is CPU-bound and parallelism matters. Use coroutines (asyncio, Node.js) for I/O-heavy servers where you control the async boundary. Use Go goroutines when you want simplicity — blocking-style code that scales without manual async management.

When cooperative scheduling fails

The cooperative model has one failure mode that preemptive threads do not: a coroutine that does not yield holds the thread.

# Infinite loop with no await — kills the event loop
async def broken():
    while True:
        process_item()  # no await, never yields

In Python, this is a bug that silently kills throughput. The event loop cannot recover. All other connections queue up behind it.

Go's runtime mitigates this with signal-based preemption (since 1.14) — even a tight CPU loop will eventually be preempted. But a goroutine calling a blocking C library function via cgo will still block its OS thread entirely.

When to reach for which model

Use goroutines (Go) when:

You want concurrent I/O without async/await ceremony
Workload is mixed (some CPU, some I/O)
You want straightforward code that the runtime parallelizes automatically

Use async/await (Python, JavaScript) when:

Workload is I/O-bound and latency-sensitive
You are building a high-connection-count server
CPU work can be isolated to thread/process pool executors

Use OS threads when:

Workload is CPU-bound and needs true parallelism
Work units are coarse-grained (threads are too heavy for thousands of fine-grained tasks)
You are using a language or library without good coroutine support

A Python asyncio web server starts responding slowly under moderate load. CPU usage is low. All requests eventually complete, but p99 latency spikes to several seconds. What is the most likely cause?

medium

The server handles REST API requests. Some endpoints call a third-party SDK that is not async-aware.

AThe event loop is processing too many coroutines simultaneously
Incorrect.The event loop handles thousands of coroutines efficiently — switching between them at await points is cheap. The number of concurrent coroutines is not the bottleneck.
BA synchronous SDK call is blocking the event loop for the duration of the external call
Correct!Synchronous library calls inside an async handler run on the event loop thread and block it entirely for their duration. Every other coroutine queues up. One slow external call (e.g., a 500ms synchronous HTTP request) with 10 concurrent users means 5-second p99 latency for the last user in queue. Fix: wrap the sync call in loop.run_in_executor().
Casyncio does not support concurrent requests without multiprocessing
Incorrect.asyncio handles concurrency well for I/O-bound workloads. The problem is blocking calls, not concurrency limits.
DPython's GIL is preventing true concurrency
Incorrect.The GIL prevents parallel CPU execution across threads, but asyncio uses a single thread intentionally. The GIL is not the bottleneck here.

Hint:Think about what happens to the event loop when a coroutine calls a synchronous function that takes 500ms.