- Published on
Full-Text Search in Elasticsearch
- Authors
- Name
- Bowen Y
What is full-text search?
Full-text search refers specifically to searching for text within content in a way that understands the nuances of human language, like handling synonyms, stemming, and relevance scoring.
Inverted Index
- When documents are indexed, the search engine processes the text and creates a list of all the unique words (terms) present in the documents.
- For each term, the engine maintains a list of documents (often with specific positions within those documents) where that term appears. This list is known as a "posting list."
How the inverted index Works in Full-Text Search?
- Query Processing
- When a user submits a search query, the search engine processes the query terms in the same way as it processes document text during indexing.
- Each term in the query is looked up in the inverted index.
- Retrieving Relevant Documents
- The search engine retrieves the posting lists for each term in the query.
- If the query has multiple terms, the search engine combines these lists to find documents that contain all (or some) of the terms, depending on the type of search (e.g., boolean, phrase search).
- Ranking
- The search engine then ranks these documents based on various factors, including how often and where the terms appear in each document, the length of the documents, and the rarity of the terms across all documents.
- The ranked results are then presented to the user.
How the Term Frequency(TF) is used in full-text search?
Step 1: Finding Related Documents
- Inverted Index Lookup: When you perform a full-text search, Elasticsearch first uses the inverted index to quickly identify all the documents that contain the query terms. At this stage, the primary goal is to find all potentially relevant documents, without immediately considering their relative relevance.
- Term Presence: The focus here is on whether the terms are present in the documents, not how frequently they appear. This step is crucial for efficiency, as it narrows down the search to a subset of the entire dataset.
Step 2: Ranking the Documents
- Term Frequency Involvement: Once Elasticsearch has a list of all documents that match the query terms, it then uses term frequency (along with other factors) to rank these documents in order of relevance to the query.
- Relevance Scoring: This is where term frequency becomes crucial. Documents where the query terms appear more frequently might be considered more relevant. However, Elasticsearch's relevance scoring algorithms (like TF-IDF or BM25) also take into account other factors, such as the rarity of the terms (document frequency) and the overall length of the document.
- Contextual Factors: Other contextual factors may also influence ranking, such as the use of synonyms, the proximity of query terms within the document, and any custom ranking criteria specified in the query.
Inverse Document Frequency (IDF)
The IDF is a measure of how much information a word provides, i.e., how common or rare it is across all documents in the dataset. IDF is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. Here's the formula for IDF:
Where:
- is the total number of documents in the dataset.
- is the number of documents that contain the term .
So, if a term is very common and appears in many documents, its IDF value decreases. Conversely, if a term is rare, its IDF value increases, indicating it provides more distinguishing power.
Missing terms in the query
An inverted index in Elasticsearch (or any full-text search system) maps terms to their locations within documents. When you perform a search query, the system looks up the terms in this index to find matching documents.
If a term from your query does not appear in any document, it essentially contributes nothing to the search results for that query. Elasticsearch will not find any matches for that term in the inverted index.
The overall search results will depend on how Elasticsearch handles the query. If your query has multiple terms, the search engine might still return documents that match the other terms in your query.
The impact of a missing term also depends on how the query is structured. For instance, in a boolean query, if the missing term is a "must" condition, then no results will be returned. However, if it's a "should" condition, other terms can still influence the search results.
In summary, the absence of a term in all documents means that this particular term does not contribute to the matching process, but the overall query can still return results based on other terms and how the query logic is structured.
Match
- the standard query for full-text search
- When a match query is executed, Elasticsearch first analyzes the query string using the same analyzer that was used for the field being searched. This process breaks the text into tokens (terms).
- The query then uses the inverted index to quickly find the documents where those tokens appear. Remember, the inverted index is essentially a mapping of terms to the documents that contain them.
- For each token derived from the query text, Elasticsearch looks up the inverted index to find and retrieve the documents containing that token.
- After gathering all the relevant documents for each token, Elasticsearch then scores these documents based on relevance. The scoring might include factors like term frequency (how often a term appears in a document) and document frequency (how often the term appears across all documents), among other criteria.
Example Scenario: Searching Tweets
Imagine you have an Elasticsearch index containing tweets, and you perform a match
query to find tweets about "sunny weather".
Step 1: Analyze the Query
- Query: "sunny weather"
- Analysis: The query is analyzed (tokenized and normalized) into terms: ["sunny", "weather"].
Step 2: Use the Inverted Index
Elasticsearch uses the inverted index to find documents containing these terms.
- Inverted Index Example:
- "sunny": [Doc1, Doc3, Doc5]
- "weather": [Doc2, Doc3, Doc6]
Step 3: Gathering Documents
- For "sunny": Retrieves Doc1, Doc3, Doc5
- For "weather": Retrieves Doc2, Doc3, Doc6
Step 4: Scoring and Combining Results
- Scoring: Each document is scored based on factors like term frequency, document frequency, etc.
- Combining Results: Documents containing both terms (like Doc3) are considered more relevant.
- Final Result: Documents are ranked by relevance. In this case, Doc3 might be ranked highest because it contains both "sunny" and "weather".
Example Explanation
- Doc1: Contains "sunny" but not "weather". Scores some points for "sunny".
- Doc2: Contains "weather" but not "sunny". Scores some points for "weather".
- Doc3: Contains both "sunny" and "weather". Scores higher points as it matches the entire query.
Conclusion
The match
query doesn't simply intersect lists of documents for each term. Instead, it gathers all documents that match any of the terms and then scores them based on their match relevance. Documents that contain more query terms (or terms that are rarer or more significant in the context) are scored higher.
Note
- This example is simplified. In reality, Elasticsearch's scoring algorithm is more complex, incorporating additional factors like the proximity of terms, the overall length of the document, etc.
- Elasticsearch can handle more complex queries and ranking logic, depending on the specific needs and configurations of the search application.
match
accelerate Relevance Scoring? Will that be time-consuming if elastic search score all the documents that contains one or more terms from a match
query?
How the Scoring Algorithms Optimization:
Example: Modern scoring algorithms, like BM25 (used by Elasticsearch), are designed for performance. They balance accuracy with computational efficiency.
BM25 and TF-IDF in Elasticsearch
- BM25: BM25 is a scoring algorithm used in Elasticsearch for relevance scoring. It is an evolution of the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm. BM25 improves upon TF-IDF by considering the length of documents and mitigating the impact of term frequency saturation. It provides a more sophisticated way of scoring documents based on query terms.
- TF-IDF: While TF-IDF was traditionally used in Elasticsearch and other search engines, modern versions of Elasticsearch use BM25 by default. However, TF-IDF is still conceptually important and underlies many of the principles of BM25.
Early Termination:
Example: For a query that only needs the top 10 results, Elasticsearch can stop considering documents once it has identified the 10 best matches, rather than scoring all possible documents.
- Block-Max WAND (Weak AND): This is a technique used to efficiently skip non-competitive documents. The idea is to keep a threshold score and skip scoring documents that are unlikely to exceed this threshold. By processing documents in blocks and using precomputed maximum scores for terms in each block, Elasticsearch can quickly disregard blocks of documents that don't have the potential to meet the threshold.
- Tiered Processing: Sometimes, Elasticsearch may first process a subset of the data or use a simplified scoring model to estimate which documents are likely to be most relevant, and then perform detailed scoring on this narrowed set.
Document Frequency Thresholds:
Example: Elasticsearch may skip scoring documents for very common terms if they exceed a certain frequency threshold, as these terms might not be useful for distinguishing relevant documents.
- Document Frequency Thresholds: In some configurations, Elasticsearch can use document frequency thresholds as a way to optimize query performance. This means that terms which appear in an extremely high number of documents (i.e., very common terms) might be skipped or given less weight in the scoring process.
- Relation to IDF: The concept is related to IDF in the sense that both are concerned with the rarity of a term across documents. However, a strict document frequency threshold is a simpler concept — it's a cut-off point, beyond which a term may be considered too common to be useful for distinguishing relevant documents. In contrast, IDF is a continuously varying score that decreases as the document frequency of a term increases.
How the Relevance Scoring is calcualted?
Documents in the Index
- Document 1: "The best coffee shop in town."
- Document 2: "A new coffee shop has opened."
- Document 3: "I love visiting the local bakery and coffee house."
Search Query
- Query: "coffee shop"
Steps in Relevance Scoring
Query Analysis:
- The query "coffee shop" is analyzed and broken down into terms: ["coffee", "shop"].
Term Frequency (TF):
- Document 1: Both "coffee" and "shop" appear once.
- Document 2: Both "coffee" and "shop" appear once.
- Document 3: "coffee" appears once, but "shop" does not appear.
Inverse Document Frequency (IDF):
- If "coffee" and "shop" are common in the entire document set, their IDF is lower. If they are rare, their IDF is higher. Assume both terms have a moderate IDF value.
Field-Length Norm:
- Shorter fields are given more weight. Let's assume all documents have similar lengths for simplicity.
Calculating Relevance Score:
- Document 1 and 2: High relevance because they contain both "coffee" and "shop".
- Document 3: Lower relevance as it only contains "coffee".
Example Relevance Score Calculation
Score Calculation (simplified formula):
- Score = TF(term1) * IDF(term1) + TF(term2) * IDF(term2) + ...
For Document 1 and 2:
- Higher score because both terms match and their combined TF-IDF score is higher.
For Document 3:
- Lower score because only one of the terms ("coffee") matches.
Conclusion
- Documents containing both terms of the query ("coffee" and "shop") receive higher scores.
- The exact scoring involves more complex mathematical formulas and additional factors like term proximity, but this example provides a basic understanding.
Note
- This is a simplified explanation. The actual scoring algorithm in Elasticsearch is more complex and includes various other factors and optimizations.
- Elasticsearch's relevance scoring is highly configurable and can be adjusted to suit specific use case requirements.