Complex_datatypes

最后更新于:2022-04-01 00:39:34

[[complex-core-fields]]=== Complex core field types Besides the simple scalar datatypes that we mentioned above, JSON alsohas `null` values, arrays and objects, all of which are supported byElasticsearch: ==== Multi-value fields It is quite possible that we want our `tag` field to contain morethan one tag. Instead of a single string, we could index an array of tags: ~~~ { "tag": [ "search", "nosql" ]} ~~~ There is no special mapping required for arrays. Any field can contain zero,one or more values, in the same way as a full text field is analyzed toproduce multiple terms. By implication, this means that _all of the values of an array must beof the same datatype_. You can't mix dates with strings. If you createa new field by indexing an array, Elasticsearch will use thedatatype of the first value in the array to determine the `type` of thenew field. The elements inside an array are not ordered. You cannot refer to `the firstelement'' or`the last element''. Rather think of an array as a _bag ofvalues_. ==== Empty fields Arrays can, of course, be empty. This is the equivalent of having zerovalues. In fact, there is no way of storing a `null` value in Lucene, soa field with a `null` value is also considered to be an emptyfield. These four fields would all be considered to be empty, and would not beindexed: ~~~ "empty_string": "","null_value": null,"empty_array": [],"array_with_null_value": [ null ] ~~~ ==== Multi-level objects The last native JSON datatype that we need to discuss is the _object_-- known in other languages as hashes, hashmaps, dictionaries orassociative arrays. _Inner objects_ are often used to embed one entity or object insideanother. For instance, instead of having fields called `user_name`and `user_id` inside our `tweet` document, we could write it as: ```js{ "tweet": "Elasticsearch is very flexible", "user": { "id": "@johnsmith", "gender": "male", "age": 26, "name": { "full": "John Smith", "first": "John", "last": "Smith" } } ### } ==== Mapping for inner objects Elasticsearch will detect new object fields dynamically and map them astype `object`, with each inner field listed under `properties`: ### [source,js] { "gb": { "tweet": { <1> "properties": { "tweet": { "type": "string" }, "user": { <2> "type": "object", "properties": { "id": { "type": "string" }, "gender": { "type": "string" }, "age": { "type": "long" }, "name": { <2> "type": "object", "properties": { "full": { "type": "string" }, "first": { "type": "string" }, "last": { "type": "string" } } } } } } } } ### } <1> Root object. <2> Inner objects. The mapping for the `user` and `name` fields have a similar structureto the mapping for the `tweet` type itself. In fact, the `type` mappingis just a special type of `object` mapping, which we refer to as the_root object_. It is just the same as any other object, except that it hassome special top-level fields for document metadata, like `_source`,the `_all` field etc. ==== How inner objects are indexed Lucene doesn't understand inner objects. A Lucene document consists of a flatlist of key-value pairs. In order for Elasticsearch to index inner objectsusefully, it converts our document into something like this: ### [source,js] { "tweet": [elasticsearch, flexible, very], "user.id": [@johnsmith], "user.gender": [male], "user.age": [26], "user.name.full": [john, smith], "user.name.first": [john], "user.name.last": [smith] ### } _Inner fields_ can be referred to by name, eg `"first"`. To distinguishbetween two fields that have the same name we can use the full _path_,eg `"user.name.first"` or even the `type` name plusthe path: `"tweet.user.name.first"`. NOTE: In the simple flattened document above, there is no field called `user`and no field called `user.name`. Lucene only indexes scalar or simple values,not complex datastructures. ==== Arrays of inner objects Finally, consider how an array containing inner objects would be indexed.Let's say we have a `followers` array which looks like this: ### [source,js] { "followers": [ { "age": 35, "name": "Mary White"}, { "age": 26, "name": "Alex Jones"}, { "age": 19, "name": "Lisa Smith"} ] ### } This document will be flattened as we described above, but theresult will look like this: ### [source,js] { "followers.age": [19, 26, 35], "followers.name": [alex, jones, lisa, smith, mary, white] ### } The correlation between `{age: 35}` and `{name: Mary White}` has been lost aseach multi-value field is just a bag of values, not an ordered array. This issufficient for us to ask: - _Is there a follower who is 26 years old?_ but we can't get an accurate answer to: - _Is there a follower who is 26 years old **and who is called Alex Jones?**_ Correlated inner objects, which are able to answer queries like these,are called _nested_ objects, and we will discuss them later on in<>.
';

Mapping

最后更新于:2022-04-01 00:39:32

# 映射 As explained in <>, each document in an index has a _type_.Every type has its own _mapping_ or _schema definition_. A mappingdefines the fields within a type, the datatype for each field,and how the field should be handled by Elasticsearch. A mapping is also usedto configure metadata associated with the type. We discuss mappings in detail in <>. In this section we're goingto look at just enough to get you started. [[core-fields]]==== Core simple field types Elasticsearch supports the following simple field types: [horizontal]String: :: `string`Whole number: :: `byte`, `short`, `integer`, `long`Floating point: :: `float`, `double`Boolean: :: `boolean`Date: :: `date` When you index a document which contains a new field -- one previously notseen -- Elasticsearch will use <> to tryto guess the field type from the basic datatypes available in JSON,using the following rules: [horizontal]_JSON type:_ :: _Field type:_ Boolean: `true` or `false` :: `"boolean"` Whole number: `123` :: `"long"` Floating point: `123.45` :: `"double"` String, valid date: `"2014-09-15"` :: `"date"` String: `"foo bar"` :: `"string"` NOTE: This means that, if you index a number in quotes -- `"123"` it will bemapped as type `"string"`, not type `"long"`. However, if the field isalready mapped as type `"long"`, then Elasticsearch will try to convertthe string into a long, and throw an exception if it can't. ==== Viewing the mapping We can view the mapping that Elasticsearch has for one or more types in one ormore indices using the `/_mapping` endpoint. At the <> we already retrieved the mapping for type `tweet` in index`gb`: ### [source,js] ### GET /gb/_mapping/tweet This shows us the mapping for the fields (called _properties_) thatElasticsearch generated dynamically from the documents that we indexed: ### [source,js] { "gb": { "mappings": { "tweet": { "properties": { "date": { "type": "date", "format": "dateOptionalTime" }, "name": { "type": "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" } } } } } ### } # [TIP] Incorrect mappings, such as having an `age` field mapped as type `string`instead of `integer`, can produce confusing results to your queries. # Instead of assuming that your mapping is correct, check it! [[custom-field-mappings]]==== Customizing field mappings The most important attribute of a field is the `type`. For fieldsother than `string` fields, you will seldom need to map anything otherthan `type`: ### [source,js] { "number_of_clicks": { "type": "integer" } ### } Fields of type `"string"` are, by default, considered to contain full text.That is, their value will be passed through an analyzer before being indexedand a full text query on the field will pass the query string through ananalyzer before searching. The two most important mapping attributes for `string` fields are`index` and `analyzer`. ===== `index` The `index` attribute controls how the string will be indexed. Itcan contain one of three values: [horizontal]`analyzed`:: First analyze the string, then index it. In other words, index this field as full text. `not_analyzed`:: Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it. `no`:: Don't index this field at all. This field will not be searchable. The default value of `index` for a `string` field is `analyzed`. If wewant to map the field as an exact value, then we need to set it to`not_analyzed`: ### [source,js] { "tag": { "type": "string", "index": "not_analyzed" } ### } The other simple types -- `long`, `double`, `date` etc -- also accept the`index` parameter, but the only relevant values are `no` and `not_analyzed`,as their values are never analyzed. ===== `analyzer` For `analyzed` string fields, use the `analyzer` attribute tospecify which analyzer to apply both at search time and at index time. Bydefault, Elasticsearch uses the `standard` analyzer, but you can change thisby specifying one of the built-in analyzers, such as`whitespace`, `simple`, or `english`: ### [source,js] { "tweet": { "type": "string", "analyzer": "english" } ### } In <> we will show you how to define and use custom analyzersas well. ==== Updating a mapping You can specify the mapping for a type when you first create an index.Alternatively, you can add the mapping for a new type (or update the mappingfor an existing type) later, using the `/_mapping` endpoint. # [IMPORTANT] While you can _add_ to an existing mapping, you can't _change_ it. If a fieldalready exists in the mapping, then it probably means that data from thatfield has already been indexed. If you were to change the field mapping, then # the already indexed data would be wrong and would not be properly searchable. We can update a mapping to add a new field, but we can't change an existingfield from `analyzed` to `not_analyzed`. To demonstrate both ways of specifying mappings, let's first delete the `gb`index: ### [source,sh] ### DELETE /gb // SENSE: 052_Mapping_Analysis/45_Mapping.json Then create a new index, specifying that the `tweet` field should usethe `english` analyzer: ### [source,js] PUT /gb <1>{ "mappings": { "tweet" : { "properties" : { "tweet" : { "type" : "string", "analyzer": "english" }, "date" : { "type" : "date" }, "name" : { "type" : "string" }, "user_id" : { "type" : "long" } } } } ### } // SENSE: 052_Mapping_Analysis/45_Mapping.json <1> This creates the index with the `mappings` specified in the body. Later on, we decide to add a new `not_analyzed` text field called `tag` to the`tweet` mapping, using the `_mapping` endpoint: ### [source,js] PUT /gb/_mapping/tweet{ "properties" : { "tag" : { "type" : "string", "index": "not_analyzed" } } ### } // SENSE: 052_Mapping_Analysis/45_Mapping.json Note that we didn't need to list all of the existing fields again, as we can'tchange them anyway. Our new field has been merged into the existing mapping. ==== Testing the mapping You can use the `analyze` API to test the mapping for string fields byname. Compare the output of these two requests: ### [source,js] GET /gb/_analyze?field=tweetBlack-cats <1> GET /gb/_analyze?field=tag ### Black-cats <1> // SENSE: 052_Mapping_Analysis/45_Mapping.json <1> The text we want to analyze is passed in the body. The `tweet` field produces the two terms `"black"` and `"cat"`, while the`tag` field produces the single term `"Black-cats"`. In other words, ourmapping is working correctly.
';

Analysis

最后更新于:2022-04-01 00:39:30

[[analysis-intro]]=== Analysis and analyzers _Analysis_ is the process of: - first, tokenizing a block of text intoindividual _terms_ suitable for use in an inverted index, - then normalizing these terms into a standard form to improve their``searchability'' or _recall_. This job is performed by _analyzers_. An _analyzer_ is really just a wrapperwhich combines three functions into a single package: Character filters:: ~~~ First, the string is passed through any _character filters_ in turn. Theirjob is to tidy up the string before tokenization. A character filter couldbe used to strip out HTML, or to convert `"&"` characters to the word`"and"`. ~~~ Tokenizer:: Next, the string is tokenized into individual terms by a _tokenizer_. A simple tokenizer might split the text up into terms whenever it encounters whitespace or punctuation. Token filters:: Last, each term is passed through any _token filters_ in turn, which can change terms (eg lowercasing `"Quick"`), remove terms (eg stopwords like `"a"`, `"and"`, `"the"` etc) or add terms (eg synonyms like `"jump"` and `"leap"`) Elasticsearch provides many character filters, tokenizers and token filtersout of the box. These can be combined to create custom analyzers suitablefor different purposes. We will discuss these in detail in <>. ==== Built-in analyzers However, Elasticsearch also ships with a number of pre-packaged analyzers thatyou can use directly. We list the most important ones below and, to demonstratethe difference in behaviour, we show what terms each would producefrom this string: ~~~ "Set the shape to semi-transparent by calling set_trans(5)" ~~~ Standard analyzer:: The standard analyzer is the default analyzer that Elasticsearch uses. It isthe best general choice for analyzing text which may be in any language. Itsplits the text on _word boundaries_, as defined by the[http://www.unicode.org/reports/tr29/[Unicode](http://www.unicode.org/reports/tr29/[Unicode) Consortium], and removes mostpunctuation. Finally, it lowercases all terms. It would produce:+ set, the, shape, to, semi, transparent, by, calling, set_trans, 5 Simple analyzer:: The simple analyzer splits the text on anything that isn't a letter,and lowercases the terms. It would produce:+ set, the, shape, to, semi, transparent, by, calling, set, trans Whitespace analyzer:: The whitespace analyzer splits the text on whitespace. It doesn'tlowercase. It would produce:+ Set, the, shape, to, semi-transparent, by, calling, set_trans(5) Language analyzers:: Language-specific analyzers are available for many languages. They are able totake the peculiarities of the specified language into account. For instance,the `english` analyzer comes with a set of English stopwords -- common wordslike `and` or `the` which don't have much impact on relevance -- which itremoves, and it is able to _stem_ English words because it understands therules of English grammar.+The `english` analyzer would produce the following:+ set, shape, semi, transpar, call, set_tran, 5+Note how `"transparent"`, `"calling"`, and `"set_trans"` have been stemmed totheir root form. ==== When analyzers are used When we _index_ a document, its full text fields are analyzed into terms whichare used to create the inverted index. However, when we _search_ on a fulltext field, we need to pass the query string through the _same analysisprocess_, to ensure that we are searching for terms in the same form as thosethat exist in the index. Full text queries, which we will discuss later, understand how each field isdefined, and so they can do the right thing: - When you query a _full text_ field, the query will apply the same analyzerto the query string to produce the correct list of terms to search for. - When you query an _exact value_ field, the query will not analyze thequery string, but instead search for the exact value that you havespecified. Now you can understand why the queries that we demonstrated at the<> return what they do: - The `date` field contains an exact value: the single term `"2014-09-15"`. - The `_all` field is a full text field, so the analysis process hasconverted the date into the three terms: `"2014"`, `"09"` and `"15"`. When we query the `_all` field for `2014`, it matches all 12 tweets, becauseall of them contain the term `2014`: ### [source,sh] ### GET /_search?q=2014 # 12 results // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json When we query the `_all` field for `2014-09-15`, it first analyzes the querystring to produce a query which matches _any_ of the terms `2014`, `09` or`15`. This also matches all 12 tweets, because all of them contain the term`2014`: ### [source,sh] ### GET /_search?q=2014-09-15 # 12 results ! // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json When we query the `date` field for `2014-09-15`, it looks for that _exact_date, and finds one tweet only: ### [source,sh] ### GET /_search?q=date:2014-09-15 # 1 result // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json When we query the `date` field for `2014`, it finds no documentsbecause none contain that exact date: ### [source,sh] ### GET /_search?q=date:2014 # 0 results ! // SENSE: 052_Mapping_Analysis/25_Data_type_differences.json [[analyze-api]]==== Testing analyzers Especially when you are new to Elasticsearch, it is sometimes difficult tounderstand what is actually being tokenized and stored into your index. Tobetter understand what is going on, you can use the `analyze` API to see howtext is analyzed. Specify which analyzer to use in the query stringparameters, and the text to analyze in the body: ### [source,js] GET /_analyze?analyzer=standard ### Text to analyze // SENSE: 052_Mapping_Analysis/40_Analyze.json Each element in the result represents a single term: ### [source,js] { "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "", "position": 1 }, { "token": "to", "start_offset": 5, "end_offset": 7, "type": "", "position": 2 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "", "position": 3 } ] ### } The `token` is the actual term that will be stored in the index. The`position` indicates the order in which the terms appeared in the originaltext. The `start_offset` and `end_offset` indicate the character positionsthat the original word occupied in the original string. The `analyze` API is really useful tool for understanding what is happeninginside Elasticsearch indices, and we will talk more about it as we progress. ==== Specifying analyzers When Elasticsearch detects a new string field in your documents, itautomatically configures it as a full text `string` field and analyzes it withthe `standard` analyzer. You don't always want this. Perhaps you want to apply a different analyzerwhich suits the language your data is in. And sometimes you want astring field to be just a string field -- to index the exact value thatyou pass in, without any analysis, such as a string user ID or aninternal status field or tag. In order to achieve this, we have to configure these fields manuallyby specifying the _mapping_.
';

Inverted_index

最后更新于:2022-04-01 00:39:27

# 反向索引 Elasticsearch uses a structure called an _inverted index_ which is designedto allow very fast full text searches. An inverted index consists of a listof all the unique words that appear in any document, and for each word, a listof the documents in which it appears. For example, let's say we have two documents, each with a `content` fieldcontaining: 1. ``The quick brown fox jumped over the lazy dog'' 1. ``Quick brown foxes leap over lazy dogs in summer'' To create an inverted index, we first split the `content` field of eachdocument into separate words (which we call _terms_ or _tokens_), create asorted list of all the unique terms, then list in which document each termappears. The result looks something like this: ~~~ Term Doc_1 Doc_2-------------------------Quick | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |------------------------ ~~~ Now, if we want to search for `"quick brown"` we just need to find thedocuments in which each term appears: ~~~ Term Doc_1 Doc_2-------------------------brown | X | Xquick | X |------------------------Total | 2 | 1 ~~~ Both documents match, but the first document has more matches than the second.If we apply a naive _similarity algorithm_ which just counts the number ofmatching terms, then we can say that the first document is a better match --is _more relevant_ to our query -- than the second document. But there are a few problems with our current inverted index: 1. `"Quick"` and `"quick"` appear as separate terms, while the user probablythinks of them as the same word. 1. `"fox"` and `"foxes"` are pretty similar, as are `"dog"` and `"dogs"`-- they share the same root word. 1. `"jumped"` and `"leap"`, while not from the same root word, are similarin meaning -- they are synonyms. With the above index, a search for `"+Quick +fox"` wouldn't match anydocuments. (Remember, a preceding `+` means that the word must be present).Both the term `"Quick"` and the term `"fox"` have to be in the same documentin order to satisfy the query, but the first doc contains `"quick fox"` andthe second doc contains `"Quick foxes"`. Our user could reasonably expect both documents to match the query. We can dobetter. If we normalize the terms into a standard format, then we can find documentsthat contain terms that are not exactly the same as the user requested, butare similar enough to still be relevant. For instance: 1. `"Quick"` can be lowercased to become `"quick"`. 1. `"foxes"` can be _stemmed_ -- reduced to its root form -- tobecome `"fox"`. Similarly `"dogs"` could be stemmed to `"dog"`. 1. `"jumped"` and `"leap"` are synonyms and can be indexed as just thesingle term `"jump"`. Now the index looks like this: ~~~ Term Doc_1 Doc_2-------------------------brown | X | Xdog | X | Xfox | X | Xin | | Xjump | X | Xlazy | X | Xover | X | Xquick | X | Xsummer | | Xthe | X | X------------------------ ~~~ But we're not there yet. Our search for `"+Quick +fox"` would _still_ fail,because we no longer have the exact term `"Quick"` in our index. However, ifwe apply the same normalization rules that we used on the `content` field toour query string, it would become a query for `"+quick +fox"`, which wouldmatch both documents! IMPORTANT: This is very important. You can only find terms that actually exist in yourindex, so: _both the indexed text and and query string must be normalizedinto the same form_. This process of tokenization and normalization is called _analysis_, which wediscuss in the next section.
';

Exact_vs_full_text

最后更新于:2022-04-01 00:39:25

# 精确值与全文 Data in Elasticsearch can be broadly divided into two types:_exact values_ and _full text_. Exact values are exactly what they sound like. Examples would be a date or auser ID, but can also include exact strings like a username or an emailaddress. The exact value `"Foo"` is not the same as the exact value `"foo"`.The exact value `2014` is not the same as the exact value `2014-09-15`. Full text, on the other hand, refers to textual data -- usually written insome human language -- like the text of a tweet or the body of an email. Full text is often referred to as ``unstructured data'', which is a misnomer-- natural language is highly structured. The problem is that the rules ofnatural languages are complex which makes them difficult for computers toparse correctly. For instance, consider this sentence: ~~~ May is fun but June bores me. ~~~ Does it refer to months or to people? Exact values are easy to query. The decision is binary -- a value eithermatches the query, or it doesn't. This kind of query is easy to express withSQL: ~~~ WHERE name = "John Smith" AND user_id = 2 AND date > "2014-09-15" ~~~ Querying full text data is much more subtle. We are not just asking `Doesthis document match the query'', but`How _well_ does this document match thequery?'' In other words, how _relevant_ is this document to the given query? We seldom want to match the whole full text field exactly. Instead, we wantto search _within_ text fields. Not only that, but we expect search tounderstand our _intent_: - a search for `"UK"` should also return documents mentioning the `"UnitedKingdom"` - a search for `"jump"` should also match `"jumped"`, `"jumps"`, `"jumping"`and perhaps even `"leap"` - `"johnny walker"` should match `"Johnnie Walker"` and `"johnnie depp"`should match `"Johnny Depp"` - `"fox news hunting"` should return stories about hunting on Fox News,while `"fox hunting news"` should return news stories about fox hunting. In order to facilitate these types of queries on full text fields,Elasticsearch first _analyzes_ the text, then uses the results to buildan _inverted index_. We will discuss the inverted index and theanalysis process in the next two sections.
';

映射与统计

最后更新于:2022-04-01 00:39:23

# 映射与统计 当我们在进行搜索的事情,我们会发现有一些奇怪的事情。比如有一些内容似乎是被打破了:在我们的索引中有12条推文,中有一个包含了`2014-09-15`这个日期,但是看看下面的查询结果中的总数量: ~~~ GET /_search?q=2014 # 12 results GET /_search?q=2014-09-15 # 12 results ! GET /_search?q=date:2014-09-15 # 1 result GET /_search?q=date:2014 # 0 results ! ~~~ 为什么我们使用字段`_all`搜索全年就会返回所有推文,而使用字段`date`搜索年份却没有结果呢?为什么使用两者所得到的结果是不同的? 推测大概是因为我们的数据在`_all`和`date`在索引时没有被相同处理。我们来看看Elasticsearch是如何处理我们的文档结构的。我们可以对`gb`的`tweet`使用_mapping_请求: ~~~ GET /gb/_mapping/tweet ~~~ 我们得到: ~~~ { "gb": { "mappings": { "tweet": { "properties": { "date": { "type": "date", "format": "dateOptionalTime" }, "name": { "type": "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" } } } } } } ~~~ Elasticsearch会根据系统自动判断字段类型并生成一个映射。返回结果告诉我们`date`字段被识别成了`date`类型。`_all`没有出现是因为他是默认字段,但是我们知道字段`_all`实际上是`string`类型的。 所以类型为`date`的字段和类型为`string`的字段的索引方式是不同的。 So fields of type `date` and fields of type `string` are indexed differently,and can thus be searched differently. That's not entirely surprising.You might expect that each of the core data types -- strings, numbers, booleansand dates -- might be indexed slightly differently. And this is true:there are slight differences. But by far the biggest difference is actually between fields that represent_exact values_ (which can include `string` fields) and fields thatrepresent _full text_. This distinction is really important -- it's the thingthat separates a search engine from all other databases.
';

查询语句

最后更新于:2022-04-01 00:39:21

# _精简_ 搜索 搜索的API分为两种:其一是通过参数来传递查询的“精简版”_查询语句(query string)_,还有一种是通过JSON来传达丰富的查询的完整版_请求体(request body)_,这种搜索语言被称为查询DSL。 查询语句在行命令中运行点对点查询的时候非常实用。比如我想要查询所有`tweet`类型中,所有`tweet`字段为`"elasticsearch"`的文档: ~~~ GET /_all/tweet/_search?q=tweet:elasticsearch ~~~ 下一个查询是想要寻找`name`字段为`"john"`且`tweet`字段为`"mary"`的文档,实际的查询就是: ~~~ +name:john +tweet:mary ~~~ 但是经过_百分号编码(percent encoding)_处理后,会让它看起来稍显神秘: ~~~ GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary ~~~ 前缀`"+"`表示**必须要**满足我们的查询匹配条件,而前缀`"-"`则表示**绝对不能**匹配条件。没有`+`或者`-`的表示可选条件。匹配的越多,文档的相关性就越大。 ### 字段`_all` 下面这条简单的搜索将会返回所有包含`"mary"`字符的文档: ~~~ GET /_search?q=mary ~~~ 在之前的例子中,我们搜索`tweet`或者`name`中的文字。然而,搜索的结果显示`"mary"`在三个不同的字段中: - 用户的名字为"Mary" - 6个"Mary"发送的推文 - 1个"@mary" 那么Elasticsearch是如何找到三个不同字段中的内容呢? 当我们在索引一个文档的时候,Elasticsearch会将所有字段的数值都汇总到一个大的字符串中,并将它索引成一个特殊的字段`_all`: ~~~ { "tweet": "However did I manage before Elasticsearch?", "date": "2014-09-14", "name": "Mary Jones", "user_id": 1 } ~~~ 就好像我们已经添加了一个叫做`_all`的字段: ~~~ "However did I manage before Elasticsearch? 2014-09-14 Mary Jones 1" ~~~ 除非指定了字段名,不然查询语句就会搜索字段`_all`。 TIP: 在你刚开始创建程序的时候你可能会经常使用`_all`这个字段。但是慢慢的,你可能就会在请求中指定字段。当字段`_all`已经没有使用价值的时候,那就可以将它关掉。之后的《字段all》一节中将会有介绍 ### 更加复杂的查询 再实现一个查询: - 字段`name`包含`"mary"`或`"john"` - `date`大于`2014-09-10` - `_all`字段中包含`"aggregations"`或`"geo"` ~~~ +name:(mary john) +date:>2014-09-10 +(aggregations geo) ~~~ 最终处理完的语句可读性可能很差: ~~~ ?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo) ~~~ 正如你所看到的,这个_简明_查询语句是出奇的强大。在[查询语句语法](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current//query-dsl-query-string-query.html#query-string-syntax)中,有关于它详细的介绍。借助它我们就可以在开发的时候提高很多效率。 不过,你也会发现简洁带来的易读性差和难以调试,以及它的脆弱:当其中出现`-`, `:`, `/` 或者 `"`时,它就会返回错误提示。 最后要提一句,任何用户都可以通过查询语句来访问臃肿的查询,或许会得到一些私人的信息,或许会通过大量的运算将你的集群压垮! > ### TIP 出于以上原因,我们不建议你将查询语句直接暴露给用户,除非是你信任的可以访问数据与集群的权限用户。 与此同时,在生产环境中,我们经常会使用到查询语句。在了解更多关于搜索的知识前,我们先来看一下它是怎样运作的。
';

分页

最后更新于:2022-04-01 00:39:18

# 分页 在《空白搜索》一节中,搜索结果告诉我们在集群中共有14个文档匹配我们的(空白)查询。但是在`hits`数组中只有10个文档。我们怎样才能看到其他的呢? 与SQL使用`LIMIT`来控制单“页”数量类似,Elasticsearch使用的是`from`以及`size`两个参数: | 参数 | 说明 | |-----|-----| | `size` | 每次返回多少个结果,默认值为`10` | | `from` | 忽略最初的几条结果,默认值为`0` | 假设每页显示5条结果,那么1至3页的请求就是: ~~~ GET /_search?size=5GET /_search?size=5&from=5GET /_search?size=5&from=10 ~~~ 当心不要一次请求过多或者页码过大的结果。它们会在返回前排序。一个请求会经过多个分片。每个分片都会生成自己的排序结果。然后再进行集中整理,以确保最终结果的正确性。 > ### 分布式系统中的大页码页面 为了说明白为什么页码过大的请求会产生问题,我们就先预想一下我们在搜索一个拥有5个主分片的索引。当我们请求第一页搜索的时候,每个分片产生自己前十名,然后将它们返回给_请求节点_,然后这个节点会将50条结果重新排序以产生最终的前十名。 现在想想一下我们想获得第1,000页,也就是第10,001到第10,010条结果,与之前同理,每一个分片都会先产生自己的前10,010名,然后请求节点统一处理这50,050条结果,然后再丢弃掉其中的50,040条! 现在你应该明白了,在分布式系统中,大页码请求所消耗的系统资源是呈指数式增长的。这也是为什么网络搜索引擎不会提供超过1,000条搜索结果的原因。 > ### TIP 在《重索引》一章中,我们将详细探讨如何才能高效地获取大量数据。
';

多索引多类型

最后更新于:2022-04-01 00:39:16

# 多索引,多类型 你是否注意到了《空白搜索》一章节的文档中包含了很多不同的类型 —— `user`与`tweet`,它们也分别来自`us`、`gb`这两个不同的索引? 当我们没有特别指定一个所以或者类型的时候,我们将会搜索整个集群中的**所有**文档。Elasticsearch会把搜索请求转发给集群中的每一个主从分片,然后按照结果的相关性得到前十名,并将它们返回给我们。 然而,往往我们只需要在某一个特定的索引的几个类型中进行搜索。我们可以通过在URL中定义它来实现这个功能: | URL | 说明 | |-----|-----| | `/_search` | 搜索所有的索引和类型 | | `/gb/_search` | 搜索索引`gb`中的所有类型 | | `/gb,us/_search` | 搜索索引`gb`以及`us`中的所有类型 | | `/g*,u*/_search` | 搜索所有以`g`或`u`开头的索引中的所有类型 | | `/gb/user/_search` | 搜索索引`gb`中类型`user`内的所有文档 | | `/gb,us/user,tweet/_search` | 搜索索引`user`以及`tweet`中类型`gb` and `us`内的所有文档 | | `/_all/user,tweet/_search` | 搜索所有索引中类型为`user`以及`tweet`内的所有文档 | 当你在一个索引中搜索的时候,Elasticsearch或将你的搜索请求转发给相应索引中的所有主从分片,然后收集每一个分片的结果。在多个索引中搜索也是想通的流程,只不过是增加了一些参与分片。 > ### 重要提示 搜索一个拥有五个主分片的索引与搜索五个都只拥有一个主分片是**完全一样**的。 在后面,你将会了解到如何利用这一点,来根据你的需要灵活打造系统。
';

空白搜索

最后更新于:2022-04-01 00:39:14

# 空白搜索 搜索API最常用的一种形式就是_空白搜索_,也就是不加任何查询条件的,只是返回集群中所有文档的搜索。 ~~~ GET /_search ~~~ 返回内容如下(有删减): ~~~ { "hits" : { "total" : 14, "hits" : [ { "_index": "us", "_type": "tweet", "_id": "7", "_score": 1, "_source": { "date": "2014-09-17", "name": "John Smith", "tweet": "The Query DSL is really powerful and flexible", "user_id": 2 } }, ... 9 个结果被隐藏 ... ], "max_score" : 1 }, "took" : 4, "_shards" : { "failed" : 0, "successful" : 10, "total" : 10 }, "timed_out" : false } ~~~ ### `hits` 返回内容中最重要的内容就是`hits`,它指明了匹配查询的文档的`总数`,`hits`数组里则会包含前十个匹配文档——也就是搜索结果。 `hits`数组中的每一条结果都包含了文档的`_index`, `_type`以及`_id`信息,以及`_source`字段。这也就意味着你可以直接从搜索结果中获取到整个文档的内容。这与其他搜索引擎只返回给你文档编号,还需要自己去获取文档是截然不同的。 每一个元素还拥有一个`_score`字段。这个是_相关性评分_,这个数值表示当前文档与查询的匹配程度。通常来说,搜索结果会先返回最匹配的文档,也就是说它们会按照`_score`由高至低进行排列。在这个例子中,我们并没有声明任何查询,因此`_score`就都会返回`1` `max_score`数值会显示所有匹配文档中的`_score`的最大值。 ### `took` `took`数值告诉我们执行这次搜索请求所耗费的时间有多少毫秒。 ### `shards` `_shards`告诉了我们参与查询分片的总数,以及有多少`successful`和`failed`。通常情况下我们是不会得到失败的反馈,但是有的时候它会发生。如果我们的服务器突然出现了重大事故,然后我们丢失了同一个分片中主从两个版本的数据。在查询请求中,无法提供可用的备份。这种情况下,Elasticsearch就会返回`failed提示,但是它还会继续返回剩下的内容。 ### `timeout` `timed_out`数值告诉了我们查询是否超时。通常,搜索请求不会超时。如果相比完整的结果你更需要的是快速的响应时间,这是你可以指定`timeout`值,例如`10`、`"10ms"`(10毫秒)或者`"1s"`(1分钟): ~~~ GET /_search?timeout=10ms ~~~ Elasticsearch会尽可能地返回你指定时间内它所查到的内容。 > ### Timeout并不是终止者 这里应该强调一下`timeout`并不会终止查询,它只是会在你指定的时间内返回_当时_已经查询到的数据,然后关闭连接。在后台,其他的查询可能会依旧继续,尽管查询结果已经被返回了。 使用超时是因为你要保障你的品质,并不是因为你需要终止你的查询。
';

搜索

最后更新于:2022-04-01 00:39:12

### 搜索 – 基本工具 到目前为止,我们已经学习了Elasticsearch的分布式NOSQL文档存储,我们可以直接把JSON文档扔到Elasticsearch中,然后直接通过ID来进行调取。但是Elasticsearch真正的强大之处在于将混乱变得有意义——将大数据变成大量的信息。 这也是我们使用JSON文档而不是无规则数据的原因。Elasticsearch不仅仅只是_存储_文档,同时它还_索引_了这些文档以便搜索。**文档中每一个字段都被索引并且可以被查询**。不仅如此,在一个查询中,Elasticsearch可以使用**所有**索引,并且以惊人的速度返回结果。这是传统数据库永远也不能企及的。 这个_搜索_可以是: - 类似于`年龄`、`性别`、`加入日期`等结构化数据,类似于在SQL中进行查询。 - 全文搜索,查找整个文档中匹配关键字的内容,并根据_相关性_ - 或者结合两者。 虽然很多搜索操作是安装好Elasticsearch就可以用的,但是想发挥它的潜力,你需要明白以下内容: | 名字 | 说明 | |-----|-----| | _映射 (Mapping)_ | 每个字段中的数据如何被解释 | | _统计 (Analysis)_ | 可搜索的全文是如何被处理的 | | _查询 (Query DSL)_ | Elasticsearch使用的灵活强的查询语言 | 上述的每一个内容都是一个大的主题,我们将会在之后的《深入搜索》中详细探讨它们。 本章中我们将针对先去介绍它们三个的基本概念 —— 已经足够能帮助你理解搜索是如何运作的了。 我们将向你介绍`search`API的简单实用方式。 > ### 测试数据 我们本章使用的文档可以在下面的git中找到:[https://gist.github.com/clintongormley/8579281](https://gist.github.com/clintongormley/8579281) 你可以下载然后导入到你的shell中以方便你的学习使用。
';

批量格式

最后更新于:2022-04-01 00:39:09

### Why the funny format? When we learned about Bulk requests earlier in <>, you may have askedyourself: ``Why does the`bulk`API require the funny format with the newlinecharacters, instead of just sending the requests wrapped in a JSON array, likethe`mget` API?'' To answer this, we need to explain a little background: Each document referenced in a bulk request may belong to a different primaryshard, each of which may be allocated to any of the nodes in the cluster. Thismeans that every _action_ inside a `bulk` request needs to be forwarded to thecorrect shard on the correct node. If the individual requests were wrapped up in a JSON array, that would meanthat we would need to: - parse the JSON into an array (including the document data, whichcan be very large) - look at each request to determine which shard it should go to - create an array of requests for each shard - serialize these arrays into the internal transport format - send the requests to each shard It would work, but would need a lot of RAM to hold copies of essentiallythe same data, and would create many more data structures that the JVMwould have to spend time garbage collecting. Instead, Elasticsearch reaches up into the networking buffer, where the rawrequest has been received and reads the data directly. It uses the newlinecharacters to identify and parse just the small _action/metadata_ lines inorder to decide which shard should handle each request. These raw requests are forwarded directly to the correct shard. Thereis no redundant copying of data, no wasted data structures. The entirerequest process is handled in the smallest amount of memory possible.
';

批量请求

最后更新于:2022-04-01 00:39:07

### 多文档模式 The patterns for the `mget` and `bulk` APIs are similar to those forindividual documents. The difference is that the requesting node knows inwhich shard each document lives. It breaks up the multi-document request intoa multi-document request _per shard_, and forwards these in parallel to eachparticipating node. Once it receives answers from each node, it collates their responsesinto a single response, which it returns to the client. [[img-distrib-mget]].Retrieving multiple documents with `mget`image::images/04-05_mget.png["Retrieving multiple documents with mget"] Below we list the sequence of steps necessary to retrieve multiple documentswith a single `mget` request, as depicted in <>: 1. The client sends an `mget` request to `Node_1`. 1. `Node 1` builds a multi-get request per shard, and forwards theserequests in parallel to the nodes hosting each required primary or replicashard. Once all replies have been received, `Node 1` builds the responseand returns it to the client. A `routing` parameter can be set for each document in the `docs` array,and the `preference` parameter can be set for the top-level `mget`request. [[img-distrib-bulk]].Multiple document changes with `bulk`image::images/04-06_bulk.png["Multiple document changes with bulk"] Below we list the sequence of steps necessary to execute multiple`create`, `index`, `delete` and `update` requests within a single`bulk` request, as depicted in <>: 1. The client sends a `bulk` request to `Node_1`. 1. `Node 1` builds a bulk request per shard, and forwards these requests in parallel to the nodes hosting each involved primary shard. 1. The primary shard executes each action serially, one after another. As eachaction succeeds, the primary forwards the new document (or deletion) to itsreplica shards in parallel, then moves on to the next action. Once allreplica shards report success for all actions, the node reports success tothe requesting node, which collates the responses and returns them to theclient. The `bulk` API also accepts the `replication` and `consistency` parametersat the top-level for the whole `bulk` request, and the `routing` parameterin the metadata for each request.
';

局部更新

最后更新于:2022-04-01 00:39:05

### 更新文档中的一部分 The `update` API combines the read and write patterns explained above. [[img-distrib-update]].Partial updates to a documentimage::images/04-04_update.png["Partial updates to a document"] Below we list the sequence of steps used to perform a partial update on adocument, as depicted in <>: 1. The client sends an update request to `Node_1`. 1. It forwards the request to `Node 3`, where the primary shard is allocated. 1. `Node 3` retrieves the document from the primary shard, changes the JSONin the `_source` field, and tries to reindex the document on the primaryshard. If the document has already been changed by another process, itretries step 3 up to `retry_on_conflict` times, before giving up. 1. If `Node 3` has managed to update the document successfully, it forwardsthe new version of the document in parallel to the replica shards on `Node1` and `Node 2` to be reindexed. Once all replica shards report success,`Node 3` reports success to the requesting node, which reports success tothe client. The `update` API also accepts the `routing`, `replication`, `consistency` and`timeout` parameters that are explained in <>. > ### 基于文档的复制 When a primary shard forwards changes to its replica shards, it doesn'tforward the update request. Instead it forwards the new version of the fulldocument. Remember that these changes are forwarded to the replica shardsasynchronously and there is no guarantee that they will arrive in the sameorder that they were sent. If Elasticsearch forwarded just the change, it ispossible that changes would be applied in the wrong order, resulting in acorrupt document.
';

获取

最后更新于:2022-04-01 00:39:02

### 获取一个文档 A document can be retrieved from a primary shard or from any of its replicas. [[img-distrib-read]].Retrieving a single documentimage::images/04-03_get.png["Retrieving a single document"] Below we list the sequence of steps to retrieve a document from either aprimary or replica shard, as depicted in <>: 1. The client sends a get request to `Node 1`. 1. The node uses the document's `_id` to determine that the documentbelongs to shard `0`. Copies of shard `0` exist on all three nodes.On this occasion, it forwards the request to `Node 2`. 1. `Node 2` returns the document to `Node 1` which returns the documentto the client. For read requests, the requesting node will choose a different shard copy onevery request in order to balance the load -- it round-robins through allshard copies. It is possible that a document has been indexed on the primary shard buthas not yet been copied to the replica shards. In this case a replicamight report that the document doesn't exist, while the primary would havereturned the document successfully.
';

创建索引删除

最后更新于:2022-04-01 00:39:00

### 创建、索引、删除文档 Create, index and delete requests are _write_ operations, which must besuccessfully completed on the primary shard before they can be copied to anyassociated replica shards. [[img-distrib-write]].Creating, indexing or deleting a single documentimage::images/04-02_write.png["Creating, indexing or deleting a single document"] Below we list the sequence of steps necessary to successfully create, index ordelete a document on both the primary and any replica shards, as depicted in<>: 1. The client sends a create, index or delete request to `Node_1`. 1. The node uses the document's `_id` to determine that the documentbelongs to shard `0`. It forwards the request to `Node 3`,where the primary copy of shard `0` is currently allocated. 1. `Node 3` executes the request on the primary shard. If it is successful,it forwards the request in parallel to the replica shards on `Node 1` and`Node 2`. Once all of the replica shards report success, `Node 3` reportssuccess to the requesting node, which reports success to the client. By the time the client receives a successful response, the document change hasbeen executed on the primary shard and on all replica shards. Your change issafe. There are a number of optional request parameters which allow you to influencethis process, possibly increasing performance at the cost of data security.These options are seldom used because Elasticsearch is already fast, but theyare explained here for the sake of completeness. `replication`:: ### + The default value for replication is `sync`. This causes the primary shard towait for successful responses from the replica shards before returning. If you set `replication` to `async`, then it will return success to the clientas soon as the request has been executed on the primary shard. It will stillforward the request to the replicas, but you will not know if the replicassucceeded or not. It is advisable to use the default `sync` replication as it is possible tooverload Elasticsearch by sending too many requests without waiting for their ### completion. `consistency`:: ### + By default, the primary shard requires a _quorum_ or majority of shard copies(where a shard copy can be a primary or a replica shard) to be availablebefore even attempting a write operation. This is to prevent writing data to the``wrong side'' of a network partition. A quorum is defined as: ~~~ int( (primary + number_of_replicas) / 2 ) + 1 ~~~ The allowed values for `consistency` are `one` (just the primary shard), `all`(the primary and all replicas) or the default `quorum` or majority of shardcopies. Note that the `number_of_replicas` is the number of replicas _specified_ inthe index settings, not the number of replicas that are currently active. Ifyou have specified that an index should have 3 replicas then a quorum wouldbe: ~~~ int( (primary + 3 replicas) / 2 ) + 1 = 3 ~~~ But if you only start 2 nodes, then there will be insufficient active shardcopies to satisfy the quorum and you will be unable to index or delete anydocuments. -- `timeout`:: What happens if insufficient shard copies are available? Elasticsearch waits,in the hope that more shards will appear. By default it will wait up to oneminute. If you need to, you can use the `timeout` parameter to make it abortsooner: `100` is 100 milliseconds, `30s` is 30 seconds. # [NOTE] A new index has `1` replica by default, which means that two active shardcopies _should_ be required in order to satisfy the need for a `quorum`.However, these default settings would prevent us from doing anything usefulwith a single-node cluster. To avoid this problem, the requirement for # a quorum is only enforced when `number_of_replicas` is greater than `1`.
';

主从互通

最后更新于:2022-04-01 00:38:58

### 主从库之间是如何通信的 For explanation purposes, let's imagine that we have a clusterconsisting of 3 nodes. It contains one index called `blogs` which hastwo primary shards. Each primary shard has two replicas. Copies ofthe same shard are never allocated to the same node, so our clusterlooks something like <>. [[img-distrib]].A cluster with three nodes and one indeximage::images/04-01_index.png["A cluster with three nodes and one index"] We can send our requests to any node in the cluster. Every node is fullycapable of serving any request. Every node knows the location of everydocument in the cluster and so can forward requests directly to the requirednode. In the examples below, we will send all of our requests to `Node 1`,which we will refer to as the _requesting node_. TIP: When sending requests, it is good practice to round-robin through all thenodes in the cluster, in order to spread the load.
';

路由

最后更新于:2022-04-01 00:38:56

### 将文档路由到从库中 When you index a document, it is stored on a single primary shard. How doesElasticsearch know which shard a document belongs to? When we create a newdocument, how does it know whether it should store that document on shard 1 orshard 2? The process can't be random, since we may need to retrieve the document in thefuture. In fact, it is determined by a very simple formula: ~~~ shard = hash(routing) % number_of_primary_shards ~~~ The `routing` value is an arbitrary string, which defaults to the document's`_id` but can also be set to a custom value. This `routing` string is passedthrough a hashing function to generate a number, which is divided by thenumber of primary shards in the index to return the _remainder_. The remainderwill always be in the range `0` to `number_of_primary_shards - 1`, and givesus the number of the shard where a particular document lives. This explains why the number of primary shards can only be set when an indexis created and never changed: if the number of primary shards ever changed inthe future, all previous routing values would be invalid and documents wouldnever be found. All document APIs (`get`, `index`, `delete`, `bulk`, `update` and `mget`)accept a `routing` parameter that can be used to customize the document-to-shard mapping. A custom routing value could be used to ensure that all relateddocuments -- for instance all the documents belonging to the same user -- arestored on the same shard. We discuss in detail why you may want to do this in<>.
';

分布式文档存储

最后更新于:2022-04-01 00:38:53

### 分布式文档存储 #### 本章将会在主要章节翻译结束后再继续翻译 In the last chapter, we looked at all the ways to put data into your index andthen retrieve it. But we glossed over many technical details surrounding howthe data is distributed and fetched from the cluster. This separation is doneon purpose -- you don't really need to know how data is distributed to workwith Elasticsearch. It just works. In this chapter, we are going to dive into those internal, technical detailsto help you understand how your data is stored in a distributed system. > ### 内容警告 The information presented below is for your interest. You are not required tounderstand and remember all the detail in order to use Elasticsearch. Theoptions that are discussed are for advanced users only. Read the section to gain a taste for how things work, and to know where theinformation is in case you need to refer to it in the future, but don't beoverwhelmed by the detail.
';

总结

最后更新于:2022-04-01 00:38:51

# 总结 现在你应该知道如何作为分布式文档存储来使用Elasticsearch。你可以对文档进行存储,更新,获取,删除操作,而且你还知道该如何安全的执行这些操作。这已经非常有用处了,即使我们现在仍然没有尝试更激动人心的方面 -- 在文档中进行查询操作。不过我们先探讨下分布式环境中Elasticsearch安全管理你的文档所使用的内部过程。
';