Bloom Filter: Probabilistic Membership with Tunable False Positive Rate

2 min readProgramming

A Bloom filter is a space-efficient probabilistic data structure that answers 'definitely not in set' or 'possibly in set'. It uses k hash functions to set bits in an m-bit array. False positives are possible; false negatives are not. The false positive rate is tunable via m and k. HBase, Cassandra, and Chrome's malicious URL checker all use Bloom filters.

algorithmsfiltering

How a Bloom filter works

A Bloom filter is an m-bit array initialized to all zeros, plus k independent hash functions. Each hash function maps an element to one of the m bit positions.

Insertion: hash the element with each of the k functions, set all k bits to 1.

Lookup: hash the element with each of the k functions. If any bit is 0 → element definitely not in set. If all bits are 1 → element probably in set (false positive possible).

Deletion: not supported in standard Bloom filters. Setting a bit back to 0 could invalidate other elements that share that bit.

import hashlib
import math

class BloomFilter:
    def __init__(self, capacity: int, false_positive_rate: float):
        # Optimal bit array size: m = -(n * ln(p)) / (ln(2)^2)
        self.m = math.ceil(-(capacity * math.log(false_positive_rate)) / (math.log(2) ** 2))
        # Optimal number of hash functions: k = (m/n) * ln(2)
        self.k = math.ceil((self.m / capacity) * math.log(2))
        self.bits = bytearray(math.ceil(self.m / 8))

    def _hash(self, item: str, seed: int) -> int:
        h = hashlib.md5(f"{seed}:{item}".encode()).hexdigest()
        return int(h, 16) % self.m

    def _get_bit(self, pos: int) -> bool:
        return bool(self.bits[pos // 8] & (1 << (pos % 8)))

    def _set_bit(self, pos: int):
        self.bits[pos // 8] |= (1 << (pos % 8))

    def add(self, item: str):
        for seed in range(self.k):
            self._set_bit(self._hash(item, seed))

    def might_contain(self, item: str) -> bool:
        return all(self._get_bit(self._hash(item, seed)) for seed in range(self.k))

Usage:

bf = BloomFilter(capacity=10_000, false_positive_rate=0.01)
bf.add("user:alice")
bf.add("user:bob")

print(bf.might_contain("user:alice"))   # True (definitely in set)
print(bf.might_contain("user:charlie")) # False (definitely NOT in set)
# After many inserts, ~1% chance of "True" for items never added

Bloom filter false positive rate depends on the ratio of set bits — as the filter fills, all lookups eventually return 'maybe'

ConceptAlgorithms

A Bloom filter with m bits and k hash functions has false positive probability p ≈ (1 - e^(-kn/m))^k after n insertions. As n grows toward m/k, more bits get set to 1, and the probability that all k positions for a new lookup happen to be 1 increases. A completely full Bloom filter returns 'maybe' for everything. The filter must be sized at construction time for the expected number of insertions.

Prerequisites

  • Hash functions
  • Bit arrays
  • Probability basics

Key Points

  • False negatives are impossible — if an element was added, all its bits are 1.
  • False positives occur when all k hash positions for an unlisted element are coincidentally set by other elements.
  • Optimal k for a given m and n is k = (m/n) × ln(2) ≈ 0.693 × m/n.
  • Doubling m halves the false positive rate. For 1% FPR with n=1M elements, you need ~9.6MB — far smaller than a hash set.

False positive rate formula

p ≈ (1 - e^(-kn/m))^k

where:
  n = number of elements inserted
  m = number of bits in the array
  k = number of hash functions
  e = Euler's number (2.718...)

Practical sizing: | Elements (n) | Target FPR | Bits needed (m) | Hash functions (k) | Memory | |---|---|---|---|---| | 1,000,000 | 1% | 9,585,058 | 7 | ~1.1 MB | | 1,000,000 | 0.1% | 14,377,588 | 10 | ~1.7 MB | | 10,000,000 | 1% | 95,850,584 | 7 | ~11.5 MB |

Production use cases

Database read amplification (HBase, Cassandra): Databases with LSM trees (log-structured merge trees) store data across multiple SSTables. A read for a key that doesn't exist must check every SSTable — expensive disk I/O. Each SSTable gets a Bloom filter. Before checking an SSTable, check its Bloom filter. A "definitely not here" result skips the disk read entirely. False positives cause unnecessary disk reads; false negatives would cause missing data (impossible with Bloom filters).

URL filtering (Chrome Safe Browsing): Chrome downloads a Bloom filter of known malicious URLs. When you navigate to a URL, Chrome checks the local filter first. A "definitely not malicious" result skips the network round-trip to Google's servers. A "possibly malicious" result (could be false positive) triggers a network verification call.

Cache pre-check: Before a cache miss triggers a DB query, check a Bloom filter of existing keys. Eliminates cache stampede attacks where attackers query keys known not to exist to bypass the cache and hammer the database.

📝Counting Bloom filters: enabling deletion

Standard Bloom filters don't support deletion because clearing a bit could break other elements that set the same bit. Counting Bloom filters replace each bit with a small counter (typically 4 bits):

  • Insertion: increment the k counters instead of setting bits
  • Deletion: decrement the k counters
  • Lookup: check if all k counters are > 0

The tradeoff: 4× memory overhead (4 bits per counter vs 1 bit). Counter overflow (all 4 bits = 0xF) causes incorrect deletions — use 8-bit counters for very high-cardinality sets.

Scalable Bloom filters (SBF) address the fixed-capacity limitation by chaining multiple Bloom filters, each with a stricter FPR target, adding new filters when existing ones fill up.

You use a Bloom filter to skip SSTable lookups in a database. After 6 months, the database grows 10× beyond the original capacity estimate. What happens to query performance?

medium

The Bloom filter was sized for 1M elements with 1% FPR. The actual element count is now 10M.

  • ANothing changes — Bloom filters don't have a capacity limit, they just use more memory
    Incorrect.Bloom filters have a fixed bit array size set at construction. Adding more elements than the design capacity doesn't expand the filter — it increases the occupancy of the bit array, which increases the false positive rate.
  • BThe false positive rate increases dramatically — more bits are set to 1, so more non-existent key lookups incorrectly appear in the filter and trigger unnecessary SSTable reads
    Correct!A Bloom filter sized for 1M elements at 1% FPR will have a false positive rate approaching 100% when filled with 10M elements. Almost all bit positions will be 1, so every lookup returns 'possibly in set'. The entire point of the Bloom filter — skipping unnecessary SSTable disk reads — is defeated. All reads now hit every SSTable. Fix: rebuild the Bloom filter with the new capacity, or use a scalable Bloom filter that chains multiple filters as capacity grows.
  • CThe false negative rate increases — the filter starts missing elements that were actually inserted
    Incorrect.Bloom filters cannot produce false negatives. Once a bit is set to 1, it stays 1. An inserted element's k bits are always 1 regardless of how many other elements are added.
  • DQuery performance improves because more elements are present in the SSTable, increasing cache hit rate
    Incorrect.Cache hit rate is independent of the Bloom filter's accuracy. An overloaded Bloom filter still eliminates valid cache lookups — the problem is unnecessary disk I/O from false positives, not cache misses.

Hint:What happens to the false positive rate as the filter's bit array approaches full occupancy?