LRU Cache: HashMap + Doubly Linked List, Cache Thrashing, and Production Issues

1 min readDatabases & Storage

LRU evicts the least recently used item on cache full. The O(1) implementation combines a hashmap (O(1) lookup) with a doubly linked list (O(1) reorder). Cache thrashing occurs when the working set exceeds cache size — every access is a miss. Sequential scan workloads poison LRU by filling it with data that's never reaccessed.

cachingeviction-policydata-structures

The O(1) LRU implementation

LRU requires two operations in O(1): look up a key and move the accessed item to the front of the usage order. A doubly linked list maintains order (O(1) node removal and insertion); a hashmap provides O(1) key lookup with direct pointers to list nodes.

class Node:
    def __init__(self, key: int, value: int):
        self.key = key
        self.value = value
        self.prev = None
        self.next = None

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = {}          # key → Node
        self.head = Node(0, 0)   # dummy head (most recent)
        self.tail = Node(0, 0)   # dummy tail (least recent)
        self.head.next = self.tail
        self.tail.prev = self.head

    def _remove(self, node: Node):
        node.prev.next = node.next
        node.next.prev = node.prev

    def _add(self, node: Node):
        """Add right after head (most recently used position)."""
        node.prev = self.head
        node.next = self.head.next
        self.head.next.prev = node
        self.head.next = node

    def get(self, key: int) -> int:
        if key in self.cache:
            node = self.cache[key]
            self._remove(node)
            self._add(node)
            return node.value
        return -1

    def put(self, key: int, value: int) -> None:
        if key in self.cache:
            node = self.cache[key]
            node.value = value
            self._remove(node)
            self._add(node)
        else:
            if len(self.cache) >= self.capacity:
                lru = self.tail.prev   # least recently used is before dummy tail
                self._remove(lru)
                del self.cache[lru.key]
            new_node = Node(key, value)
            self.cache[key] = new_node
            self._add(new_node)

The dummy head and tail nodes eliminate edge cases (empty list, single element). Eviction always removes tail.prev. Insertion always adds after head.

Python shortcut: OrderedDict

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)   # mark as most recently used
        return self.cache[key]

    def put(self, key: int, value: int) -> None:
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)   # evict LRU (first item)

OrderedDict.move_to_end() is O(1). popitem(last=False) removes the oldest item. This is cleaner for most use cases — use the manual implementation when you need more control (e.g., custom eviction hooks).

Sequential scan access pattern defeats LRU — each scan populates the cache with data that's immediately stale

GotchaCaching

LRU assumes recently accessed data will be reaccessed soon. Sequential scans violate this: a full table scan touches every page in order, filling the cache with the last pages scanned. Those pages are never requested again, but they evict hot pages that were actually in use. This is called sequential flooding. Databases handle it with 'LRU-2' or ring buffers for sequential scan pages.

Prerequisites

  • Cache eviction policies
  • Database buffer pool management

Key Points

  • LRU works best when access patterns show temporal locality (recently used data is soon reused).
  • Sequential scans (ETL, analytics queries) destroy LRU performance by poisoning the cache.
  • PostgreSQL uses a clock-sweep algorithm (approximate LRU) resistant to sequential flooding.
  • MySQL InnoDB uses a modified LRU with a 'young' and 'old' sublist — sequential pages enter the old sublist and don't displace hot data.

Cache thrashing

Thrashing occurs when the working set (all data needed in a time window) exceeds the cache size. Every access misses:

Working set = {A, B, C, D, E}  (5 items)
Cache size  = 3

Access sequence: A B C D E A B C D E ...

Time 1: get A → miss, cache=[A]
Time 2: get B → miss, cache=[B,A]
Time 3: get C → miss, cache=[C,B,A]
Time 4: get D → miss, evict A, cache=[D,C,B]
Time 5: get E → miss, evict B, cache=[E,D,C]
Time 6: get A → miss, evict C, cache=[A,E,D]  (A was just evicted!)
...

Hit rate = 0%. Every item is evicted before it's reused. Solutions:

  • Increase cache size to cover the working set
  • Partition the cache (hot vs warm segments) using Segmented LRU (SLRU)
  • Switch to LFU for workloads where frequency matters more than recency
📝Thread-safe LRU for concurrent access

The basic implementation above is not thread-safe. Multiple goroutines or threads reading and writing simultaneously corrupt the linked list and map.

In Python, a simple lock wrapper:

import threading

class ThreadSafeLRUCache:
    def __init__(self, capacity: int):
        self._cache = LRUCache(capacity)
        self._lock = threading.Lock()

    def get(self, key: int) -> int:
        with self._lock:
            return self._cache.get(key)

    def put(self, key: int, value: int) -> None:
        with self._lock:
            self._cache.put(key, value)

A coarse lock works for low-contention caches. For high-concurrency cases:

  • Shard the cache into N independent LRU caches based on hash(key) % N
  • Each shard has its own lock — reduces contention by N×
  • Redis uses a single-threaded event loop for O(1) atomic operations without locking

You have an LRU cache with capacity 100. Your application alternates between two access patterns: 80 hot keys accessed frequently, and a batch job that scans 200 different keys sequentially once per hour. After each batch job, cache hit rate drops from 95% to near 0% for several minutes. Why?

medium

The batch job runs sequentially through 200 unique keys. The 80 hot keys fit in the cache. The batch job's keys are never reaccessed after the scan.

  • ALRU can't store 80 hot keys and still have space for batch keys simultaneously
    Incorrect.The cache holds 100 items. 80 hot keys leave room for 20 batch keys. The batch job problem isn't space — it's that 200 sequential batch keys completely replace all 80 hot keys.
  • BThe batch job scans 200 keys, which fills the 100-item cache twice over. Every hot key is evicted by the scan. After the batch job, the cache holds the last 100 batch keys — all cold data that won't be reaccessed.
    Correct!A 200-key sequential scan fills and refills a 100-item cache completely. By the time the scan finishes, all 80 hot keys have been evicted and the cache contains the last 100 batch keys — data that will never be requested again. Normal traffic then misses on every hot key until LRU repopulates them through natural access. Fix options: use a ring buffer for the batch scan (so batch pages don't evict hot pages), or pin the 80 hot keys in a protected cache segment (SLRU), or simply run the batch job at low-traffic hours and accept the brief warmup period.
  • CThe LRU implementation has a bug — concurrent access from the batch job and normal traffic corrupts the cache
    Incorrect.The problem described is not corruption — it's eviction. Cache hit rate drops predictably after the scan and recovers as hot keys are re-cached. That's eviction behavior, not corruption.
  • D100 capacity is insufficient to hold 80 hot keys — you need a cache size of at least 200
    Incorrect.80 hot keys fit in a 100-item cache with room to spare. The problem is 200 sequential batch keys overwhelming the cache, not the cache being too small for hot keys.

Hint:What happens to the 80 hot keys as the batch job scans through 200 unique keys with a 100-item cache?