OS Timer Internals: Clock Ticks, Min-Heaps, and Timer Coalescing — Blog

How OS timers work

A hardware timer chip (HPET, local APIC) generates periodic interrupts at a configurable frequency. On Linux, the default is HZ=250 (250 interrupts/second, 4ms granularity) on servers. Each interrupt is a "tick" — the kernel updates internal time, decrements timer countdowns, and fires expired timers.

Timer data structure: min-heap (or timing wheel on some systems) keyed by expiration time.

Min-heap of pending timers:
     [5ms]
    /     \
[12ms]   [20ms]
  /
[35ms]

On each tick:

Increment jiffies (kernel time counter)
Peek at heap minimum
If min.expiry <= now: fire callback, pop from heap, repeat
If min.expiry > now: done — no other timers have expired (heap property)

O(1) per tick regardless of how many timers are pending. O(log n) to insert or remove a timer.

Timer granularity and the `sleep(1ms)` problem

import time
start = time.monotonic()
time.sleep(0.001)  # Request: sleep 1ms
elapsed = time.monotonic() - start
print(f"Actual sleep: {elapsed*1000:.2f}ms")
# Output on Linux: ~1.0-4.0ms depending on HZ and scheduling

sleep(1ms) asks the OS to wake the process after at least 1ms. With HZ=250 (4ms ticks), the earliest wakeup is 1-4 ticks (4-16ms). With HZ=1000 (1ms ticks), wakeups are accurate to ~1ms. High-frequency game loops and real-time audio require CLOCK_REALTIME with nanosleep() and often busy-wait for the final sub-millisecond.

Timer coalescing

Modern kernels coalesce timers expiring within a slack interval to reduce wakeups:

Without coalescing: 3 wakeups
  t=100ms: timer A fires
  t=102ms: timer B fires
  t=105ms: timer C fires

With coalescing (5ms slack):
  t=105ms: all three fire together — 1 wakeup

Linux uses a timer slack per-process (default 50µs; prctl(PR_SET_TIMERSLACK, ...)). Android uses aggressive coalescing for power savings. The tradeoff: timers may fire up to slack late.

Language runtime timers add a layer on top of OS timers — they may batch or approximate OS timer calls

ConceptOperating Systems

Go's time.Sleep doesn't call nanosleep() once per goroutine. The Go runtime maintains its own min-heap of sleeping goroutines and sets a single OS timer for the soonest expiry. When that timer fires, the runtime wakes all goroutines due to run and sets the next OS timer. This batching means thousands of Go goroutines sleeping for different durations require very few OS timer syscalls. Java's ScheduledExecutorService uses a similar approach — one dedicated thread with a priority queue manages all scheduled tasks.

Prerequisites

Min-heap data structure
OS scheduling
System calls

Key Points

Linux HZ=250 (default server config): 4ms tick, minimum timer resolution ~4ms.
HZ=1000 (CONFIG_HZ_1000): 1ms tick, more responsive but higher interrupt overhead.
Tickless kernels (NO_HZ): skip ticks when the CPU is idle, reducing power usage.
CLOCK_MONOTONIC for elapsed time measurement; CLOCK_REALTIME for wall clock (subject to NTP adjustments).

Measuring time correctly

// Go: elapsed time measurement
start := time.Now()
doWork()
elapsed := time.Since(start)  // uses CLOCK_MONOTONIC internally

// Wrong for elapsed time: wall clock can jump backward (NTP, leap seconds)
// time.Now() is fine in Go because it falls back to monotonic where available

# Python: use time.monotonic() for elapsed time
import time
start = time.monotonic()  # monotonic clock — never goes backward
doWork()
elapsed = time.monotonic() - start

# time.time() uses wall clock — can jump during NTP sync

# Linux: check timer resolution
cat /proc/timer_list | head -20
# See which timers are registered, their slack, expiry time

You run time.Sleep(time.Millisecond) in Go 1000 times in a loop and measure the total time. You expect ~1 second. You observe ~4 seconds. Why?

medium

Linux kernel HZ=250 (4ms ticks). Each Sleep(1ms) is requesting a 1ms sleep. The kernel's minimum timer resolution is 4ms.

AGo's runtime sleeps for longer than requested to batch goroutine wakeups
Incorrect.Go's runtime tries to wake goroutines on time, not later. The issue is the OS timer granularity, not Go batching wakeups for efficiency.
BThe kernel rounds Sleep(1ms) up to the nearest tick boundary. With HZ=250 (4ms ticks), each 1ms sleep waits for up to 4ms. 1000 × 4ms = 4 seconds.
Correct!When Go calls nanosleep(1ms), the kernel sets a timer to expire at the next tick at or after 1ms. With 4ms ticks, a 1ms request expires at the next 4ms tick — on average, ~2ms late (up to 4ms late). Over 1000 iterations, this accumulates to ~2-4× longer than expected. To get accurate sub-4ms sleeps, use HZ=1000 kernel or busy-wait. For benchmarks, this is why you measure elapsed time externally rather than trusting that N × sleep(X) = N*X total time.
CGo runtime adds 3ms overhead per goroutine context switch
Incorrect.Go's scheduler overhead per context switch is nanoseconds to microseconds, not milliseconds.
DThe loop runs in a single goroutine and goroutines can't be scheduled that fast
Incorrect.A single goroutine in a loop sleeps fine. The issue is OS timer resolution, not goroutine scheduling throughput.

Hint:What is the minimum timer resolution with Linux HZ=250? What happens when you request a sleep shorter than one tick?