Elasticsearch Full-Text Search: Inverted Index, BM25, and the Analyzer Pipeline

2 min readDatabases & Storage

Full-text search in Elasticsearch works through three layers: the analyzer pipeline (tokenization, normalization, stemming), the inverted index (term → document mapping), and BM25 scoring (term frequency saturation + document length normalization). Understanding these layers explains why match queries behave differently than term queries and why field length matters for relevance.

searchelasticsearch

The inverted index

An inverted index maps terms to the documents that contain them. When Elasticsearch indexes a document, the analyzer pipeline extracts terms; those terms become keys in the index pointing to posting lists.

Documents:
  Doc1: "PostgreSQL supports full-text search"
  Doc2: "Elasticsearch full-text search at scale"
  Doc3: "Full-text indexing strategies"

Inverted index (after analysis):
  "postgresql"  → [Doc1]
  "support"     → [Doc1]          ← stemmed from "supports"
  "full"        → [Doc1, Doc2, Doc3]
  "text"        → [Doc1, Doc2, Doc3]
  "search"      → [Doc1, Doc2]
  "elasticsearch" → [Doc2]
  "scale"       → [Doc2]
  "index"       → [Doc3]          ← stemmed from "indexing"
  "strategi"    → [Doc3]          ← stemmed from "strategies"

A query for "full text search" looks up full, text, search in this index and merges their posting lists. Doc1 and Doc2 appear in all three lists; Doc3 only in two. The merge result is then scored.

The analyzer pipeline

Every analyzed field runs text through a pipeline at index time and again at query time:

Input: "PostgreSQL Supports Full-Text Search"
         ↓
1. Character filter: strip HTML, normalize unicode
         ↓
2. Tokenizer: split on whitespace/punctuation
   → ["PostgreSQL", "Supports", "Full", "Text", "Search"]
         ↓
3. Token filters:
   - lowercase:  ["postgresql", "supports", "full", "text", "search"]
   - stop words: ["postgresql", "supports", "full", "text", "search"]  (no stopwords here)
   - stemmer:    ["postgresql", "support",  "full", "text", "search"]
         ↓
Output terms stored in inverted index

The same pipeline runs on the query string. "Searching" in a query becomes "search" after stemming, which matches "search" in the index. This is why match queries handle morphological variants — the query and the indexed terms go through identical analysis.

match queries analyze the input; term queries don't — mixing them up causes silent no-results bugs

GotchaElasticsearch

match applies the field's analyzer to the query string before lookup. term skips analysis entirely and looks up the exact string in the index. If a field is analyzed (the default for text fields), its inverted index contains lowercase, stemmed tokens — not the original text. A term query for 'PostgreSQL' on an analyzed text field finds nothing because the index only contains 'postgresql'. This is one of the most common Elasticsearch debugging problems.

Prerequisites

  • Elasticsearch mappings
  • text vs keyword field types
  • Analyzer pipeline

Key Points

  • text fields are analyzed — use match for full-text search on text fields.
  • keyword fields are not analyzed — use term for exact matches on keyword fields (tags, IDs, status values).
  • A term query on a text field silently returns 0 results if the input doesn't match the analyzed token exactly.
  • Run GET /index/_analyze to inspect what tokens a field produces from a given input string.

BM25 scoring

Elasticsearch uses BM25 (Best Match 25) as the default relevance algorithm. BM25 extends TF-IDF with two fixes: term frequency saturation and document length normalization.

TF-IDF score (simplified):

score = TF(t,d) × IDF(t)
TF(t,d) = count of term t in document d
IDF(t)  = log(N / df(t))   where N=total docs, df(t)=docs containing t

BM25 score:

score(t,d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × (|d| / avgdl)))

Where:

  • k1 (default 1.2): term frequency saturation. At k1=1.2, a term appearing 3× scores ~2× as much as a term appearing once — not 3×. Prevents highly repetitive documents from dominating.
  • b (default 0.75): document length normalization. Long documents get penalized since they're more likely to contain any given term just by length. b=0 disables length normalization; b=1 fully normalizes.
  • |d| / avgdl: ratio of document length to average document length across the index.

Practical consequence: a 50-word document containing "database" twice scores higher than a 2000-word document containing "database" ten times, because the short document's term density is higher.

📝Block-Max WAND: how Elasticsearch avoids scoring every matching document

A match query for "elasticsearch tutorial" could match millions of documents. Scoring all of them to find the top 10 is too slow. Elasticsearch uses Block-Max WAND (Weak AND) to skip non-competitive candidates:

  1. Threshold: Elasticsearch tracks the score of the 10th-best document seen so far.
  2. Upper bound: Each term has a precomputed maximum possible score contribution stored at the block level.
  3. Skip: If the sum of upper bounds for all query terms for a document block is below the current threshold, the entire block is skipped without scoring individual documents.

For a typical news search with millions of documents and a top-10 request, WAND often scores fewer than 1% of matching documents. This is why Elasticsearch can handle complex queries on large indexes with sub-100ms latency.

The tradeoff: WAND is approximate. With track_total_hits: true, Elasticsearch forces exact scoring of all matches to return the exact count — at significant performance cost on large indexes. Default behavior (track_total_hits: false) reports "10000+" and uses WAND optimization.

You have a tweet index where the tweet_text field uses the standard analyzer. You search for '#Elasticsearch' using a term query on tweet_text and get 0 results, even though you can see documents with that hashtag. Why?

medium

tweet_text is a text field with the standard analyzer. The index contains documents like 'Check out #Elasticsearch for search'. The term query uses the exact string '#Elasticsearch'.

  • AThe term query is case-sensitive — try '#elasticsearch' in lowercase
    Incorrect.This is partially right — the case is the issue — but the full explanation is that the standard analyzer lowercases tokens at index time, so '#elasticsearch' would still not match because the standard analyzer also strips special characters. The hashtag # would be removed, leaving just 'elasticsearch'.
  • BThe standard analyzer strips the # character and lowercases 'Elasticsearch', so the indexed token is 'elasticsearch'. The term query '#Elasticsearch' finds nothing because the index contains 'elasticsearch', not '#Elasticsearch'.
    Correct!The standard analyzer tokenizes text by splitting on punctuation and whitespace, strips special characters like #, and lowercases all tokens. '#Elasticsearch' becomes the token 'elasticsearch' in the index. A term query for '#Elasticsearch' looks for that exact string in the index — not found. Fix options: (1) use a keyword field (or a .keyword subfield) for exact hashtag matching, (2) use a match query which applies the same analyzer to the query string and would search for 'elasticsearch', or (3) use a custom analyzer that preserves # characters.
  • Cterm queries don't work on text fields — use match instead
    Incorrect.term queries work on text fields syntactically but behave unexpectedly because text fields are analyzed. The real problem is that the query string '#Elasticsearch' doesn't match the analyzed token 'elasticsearch', not that term queries are prohibited on text fields.
  • DThe document wasn't indexed yet — Elasticsearch has eventual consistency and the document may not be searchable immediately
    Incorrect.Elasticsearch does have a refresh interval (default 1 second) before new documents are searchable, but the question states you can see the documents. The issue is query type and analysis mismatch, not indexing latency.

Hint:What does the standard analyzer do to '#Elasticsearch' at index time?