Inverted_index

最后更新于:2022-04-01 00:39:27

# 反向索引 Elasticsearch uses a structure called an _inverted index_ which is designedto allow very fast full text searches. An inverted index consists of a listof all the unique words that appear in any document, and for each word, a listof the documents in which it appears. For example, let's say we have two documents, each with a `content` fieldcontaining: 1. ``The quick brown fox jumped over the lazy dog'' 1. ``Quick brown foxes leap over lazy dogs in summer'' To create an inverted index, we first split the `content` field of eachdocument into separate words (which we call _terms_ or _tokens_), create asorted list of all the unique terms, then list in which document each termappears. The result looks something like this: ~~~ Term Doc_1 Doc_2-------------------------Quick | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |------------------------ ~~~ Now, if we want to search for `"quick brown"` we just need to find thedocuments in which each term appears: ~~~ Term Doc_1 Doc_2-------------------------brown | X | Xquick | X |------------------------Total | 2 | 1 ~~~ Both documents match, but the first document has more matches than the second.If we apply a naive _similarity algorithm_ which just counts the number ofmatching terms, then we can say that the first document is a better match --is _more relevant_ to our query -- than the second document. But there are a few problems with our current inverted index: 1. `"Quick"` and `"quick"` appear as separate terms, while the user probablythinks of them as the same word. 1. `"fox"` and `"foxes"` are pretty similar, as are `"dog"` and `"dogs"`-- they share the same root word. 1. `"jumped"` and `"leap"`, while not from the same root word, are similarin meaning -- they are synonyms. With the above index, a search for `"+Quick +fox"` wouldn't match anydocuments. (Remember, a preceding `+` means that the word must be present).Both the term `"Quick"` and the term `"fox"` have to be in the same documentin order to satisfy the query, but the first doc contains `"quick fox"` andthe second doc contains `"Quick foxes"`. Our user could reasonably expect both documents to match the query. We can dobetter. If we normalize the terms into a standard format, then we can find documentsthat contain terms that are not exactly the same as the user requested, butare similar enough to still be relevant. For instance: 1. `"Quick"` can be lowercased to become `"quick"`. 1. `"foxes"` can be _stemmed_ -- reduced to its root form -- tobecome `"fox"`. Similarly `"dogs"` could be stemmed to `"dog"`. 1. `"jumped"` and `"leap"` are synonyms and can be indexed as just thesingle term `"jump"`. Now the index looks like this: ~~~ Term Doc_1 Doc_2-------------------------brown | X | Xdog | X | Xfox | X | Xin | | Xjump | X | Xlazy | X | Xover | X | Xquick | X | Xsummer | | Xthe | X | X------------------------ ~~~ But we're not there yet. Our search for `"+Quick +fox"` would _still_ fail,because we no longer have the exact term `"Quick"` in our index. However, ifwe apply the same normalization rules that we used on the `content` field toour query string, it would become a query for `"+quick +fox"`, which wouldmatch both documents! IMPORTANT: This is very important. You can only find terms that actually exist in yourindex, so: _both the indexed text and and query string must be normalizedinto the same form_. This process of tokenization and normalization is called _analysis_, which wediscuss in the next section.
';