Coroutines, Threads, and Goroutines: Concurrency Without the OS Overhead
Coroutines yield control voluntarily. Threads get preempted. That one distinction explains the performance characteristics, failure modes, and appropriate use cases for each.
The core distinction
A thread is preempted — the OS scheduler decides when to pause it and switch to another. A coroutine yields — it decides when to hand control back. That one difference in who controls scheduling determines almost everything about the performance, complexity, and failure modes of each model.
This is not a matter of one being better. They solve different problems. The question is which one fits the work.
Scheduling models: preemptive vs cooperative
ConceptOperating SystemsThe OS schedules threads based on time slices and priorities. Coroutines schedule themselves by explicitly yielding at defined suspension points. Both achieve concurrency, but through opposite mechanisms.
Prerequisites
- processes and threads
- OS scheduler basics
- I/O blocking
Key Points
- Preemptive: OS can pause a thread at any instruction. No code changes needed to be concurrent.
- Cooperative: coroutine runs until it explicitly yields (await, yield, channel send). Nothing pauses it otherwise.
- Preemptive threads can block the OS scheduler if they hang. Cooperative coroutines block their entire thread if they forget to yield.
- Context switch cost: OS thread switch saves and restores kernel state (~microseconds). Coroutine switch is user-space only — often nanoseconds.
What a context switch actually costs
When the OS switches between threads, it saves the full execution context: registers, stack pointer, program counter, and flushes CPU caches that were warm for the previous thread. The kernel is involved. On a modern CPU this costs roughly 1–10 microseconds, depending on the work done.
When a coroutine yields, it saves only what it needs to resume: typically the program counter and any local state the runtime tracks. No kernel involvement. The CPU cache may stay warm if coroutines run on the same thread. Cost is in the nanosecond range.
This is why systems like nginx or Node.js can handle tens of thousands of simultaneous connections without threads. Each connection is an event source. When a connection has no data, the coroutine/callback for it simply does not run.
Python asyncio: single-threaded concurrency
Python's asyncio runs all coroutines on a single thread. The event loop calls coroutines, they run until they hit an await, then the event loop picks up the next ready coroutine.
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def fetch_all(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
return await asyncio.gather(*tasks)
asyncio.gather runs all fetches concurrently. While one is waiting for the HTTP response, others run. No threads, no locks.
The critical constraint: if any coroutine blocks without yielding, the entire event loop stalls. A CPU-intensive operation — image processing, SHA256 computation, a tight loop — that runs inside a coroutine will freeze every other connection until it completes.
# This will freeze all other connections for the duration of the computation
async def bad_handler():
result = cpu_intensive_computation() # never yields — blocks event loop
return result
# Correct: run CPU work in a thread pool executor
async def good_handler():
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(None, cpu_intensive_computation)
return result
asyncio is the right tool for I/O-bound workloads with many concurrent connections. It is the wrong tool for CPU-bound workloads. Use a process pool for those.
Go goroutines: M:N scheduling
Go takes a different approach. Goroutines are not purely cooperative and not purely preemptive — they use an M:N scheduler: M goroutines multiplexed across N OS threads (where N = GOMAXPROCS, defaulting to CPU count).
The Go runtime scheduler preempts goroutines at safe points (function calls, channel operations). Since Go 1.14, it can also preempt goroutines in tight loops via signal-based preemption. Goroutines do not have to manually yield — the runtime handles it.
// Each of these runs concurrently. The runtime schedules them.
for _, url := range urls {
go func(u string) {
resp, err := http.Get(u)
// ... handle response
}(url)
}
Goroutines start with a 2–8KB stack (vs 1–8MB for OS threads) and grow dynamically. You can run a million goroutines on a single machine. OS threads at that count would exhaust memory.
💡Why goroutines can block on I/O without stalling other goroutines
When a goroutine performs a blocking syscall (file read, network I/O), the Go runtime moves it off its OS thread. The OS thread handles the syscall. Another goroutine gets scheduled on a different OS thread. When the syscall completes, the blocked goroutine becomes runnable again and waits for an available thread.
This is the key difference from Python's event loop model: you can write blocking-style code in Go (resp, err := http.Get(url)) without stalling other goroutines, because the runtime transparently multiplexes.
Thread-per-connection vs coroutines: the numbers
A web server handling 10,000 concurrent connections:
- Thread-per-connection: 10,000 × 1MB stack = 10GB memory minimum. Context switching overhead at this scale is measurable. Linux has a default thread limit around 32,768.
- asyncio/Node.js event loop: 10,000 coroutines/callbacks on 1 thread. Memory footprint is small. CPU is fully utilized only if I/O dominates — any blocking code is catastrophic.
- Go goroutines: 10,000 × 4KB stack = 40MB. Runtime handles multiplexing. Blocking code works without special handling.
The right choice depends on the workload profile, not ideology.
Threads vs coroutines vs goroutines
These are not competing options — they model different tradeoffs between control, overhead, and scalability.
- Preemptive — OS decides when to switch
- 1–8MB stack per thread; ~30K limit on Linux
- True parallelism on multi-core CPUs
- Blocking I/O is fine — thread just waits
- Synchronization requires locks, which introduce contention
- Cooperative or M:N — runtime or explicit yield
- 2–64KB stack; millions possible
- Concurrency (interleaving) rather than parallelism per coroutine
- Python asyncio: blocking I/O freezes the event loop
- Go goroutines: blocking I/O handled transparently by runtime
Use OS threads when work is CPU-bound and parallelism matters. Use coroutines (asyncio, Node.js) for I/O-heavy servers where you control the async boundary. Use Go goroutines when you want simplicity — blocking-style code that scales without manual async management.
When cooperative scheduling fails
The cooperative model has one failure mode that preemptive threads do not: a coroutine that does not yield holds the thread.
# Infinite loop with no await — kills the event loop
async def broken():
while True:
process_item() # no await, never yields
In Python, this is a bug that silently kills throughput. The event loop cannot recover. All other connections queue up behind it.
Go's runtime mitigates this with signal-based preemption (since 1.14) — even a tight CPU loop will eventually be preempted. But a goroutine calling a blocking C library function via cgo will still block its OS thread entirely.
When to reach for which model
Use goroutines (Go) when:
- You want concurrent I/O without async/await ceremony
- Workload is mixed (some CPU, some I/O)
- You want straightforward code that the runtime parallelizes automatically
Use async/await (Python, JavaScript) when:
- Workload is I/O-bound and latency-sensitive
- You are building a high-connection-count server
- CPU work can be isolated to thread/process pool executors
Use OS threads when:
- Workload is CPU-bound and needs true parallelism
- Work units are coarse-grained (threads are too heavy for thousands of fine-grained tasks)
- You are using a language or library without good coroutine support
A Python asyncio web server starts responding slowly under moderate load. CPU usage is low. All requests eventually complete, but p99 latency spikes to several seconds. What is the most likely cause?
mediumThe server handles REST API requests. Some endpoints call a third-party SDK that is not async-aware.
AThe event loop is processing too many coroutines simultaneously
Incorrect.The event loop handles thousands of coroutines efficiently — switching between them at await points is cheap. The number of concurrent coroutines is not the bottleneck.BA synchronous SDK call is blocking the event loop for the duration of the external call
Correct!Synchronous library calls inside an async handler run on the event loop thread and block it entirely for their duration. Every other coroutine queues up. One slow external call (e.g., a 500ms synchronous HTTP request) with 10 concurrent users means 5-second p99 latency for the last user in queue. Fix: wrap the sync call in loop.run_in_executor().Casyncio does not support concurrent requests without multiprocessing
Incorrect.asyncio handles concurrency well for I/O-bound workloads. The problem is blocking calls, not concurrency limits.DPython's GIL is preventing true concurrency
Incorrect.The GIL prevents parallel CPU execution across threads, but asyncio uses a single thread intentionally. The GIL is not the bottleneck here.
Hint:Think about what happens to the event loop when a coroutine calls a synchronous function that takes 500ms.