Inverted_index
最后更新于:2022-04-01 00:39:27
# 反向索引
Elasticsearch uses a structure called an _inverted index_ which is designedto allow very fast full text searches. An inverted index consists of a listof all the unique words that appear in any document, and for each word, a listof the documents in which it appears.
For example, let's say we have two documents, each with a `content` fieldcontaining:
1. ``The quick brown fox jumped over the lazy dog''
1. ``Quick brown foxes leap over lazy dogs in summer''
To create an inverted index, we first split the `content` field of eachdocument into separate words (which we call _terms_ or _tokens_), create asorted list of all the unique terms, then list in which document each termappears. The result looks something like this:
~~~
Term Doc_1 Doc_2-------------------------Quick | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |------------------------
~~~
Now, if we want to search for `"quick brown"` we just need to find thedocuments in which each term appears:
~~~
Term Doc_1 Doc_2-------------------------brown | X | Xquick | X |------------------------Total | 2 | 1
~~~
Both documents match, but the first document has more matches than the second.If we apply a naive _similarity algorithm_ which just counts the number ofmatching terms, then we can say that the first document is a better match --is _more relevant_ to our query -- than the second document.
But there are a few problems with our current inverted index:
1.
`"Quick"` and `"quick"` appear as separate terms, while the user probablythinks of them as the same word.
1.
`"fox"` and `"foxes"` are pretty similar, as are `"dog"` and `"dogs"`-- they share the same root word.
1.
`"jumped"` and `"leap"`, while not from the same root word, are similarin meaning -- they are synonyms.
With the above index, a search for `"+Quick +fox"` wouldn't match anydocuments. (Remember, a preceding `+` means that the word must be present).Both the term `"Quick"` and the term `"fox"` have to be in the same documentin order to satisfy the query, but the first doc contains `"quick fox"` andthe second doc contains `"Quick foxes"`.
Our user could reasonably expect both documents to match the query. We can dobetter.
If we normalize the terms into a standard format, then we can find documentsthat contain terms that are not exactly the same as the user requested, butare similar enough to still be relevant. For instance:
1.
`"Quick"` can be lowercased to become `"quick"`.
1.
`"foxes"` can be _stemmed_ -- reduced to its root form -- tobecome `"fox"`. Similarly `"dogs"` could be stemmed to `"dog"`.
1.
`"jumped"` and `"leap"` are synonyms and can be indexed as just thesingle term `"jump"`.
Now the index looks like this:
~~~
Term Doc_1 Doc_2-------------------------brown | X | Xdog | X | Xfox | X | Xin | | Xjump | X | Xlazy | X | Xover | X | Xquick | X | Xsummer | | Xthe | X | X------------------------
~~~
But we're not there yet. Our search for `"+Quick +fox"` would _still_ fail,because we no longer have the exact term `"Quick"` in our index. However, ifwe apply the same normalization rules that we used on the `content` field toour query string, it would become a query for `"+quick +fox"`, which wouldmatch both documents!
IMPORTANT: This is very important. You can only find terms that actually exist in yourindex, so: _both the indexed text and and query string must be normalizedinto the same form_.
This process of tokenization and normalization is called _analysis_, which wediscuss in the next section.