ElasticSearch(二)-结构化搜索
参考文献
目录
-
-
精确值检索
-
范围检索
-
存在与否检索
-
前缀检索
-
通配符检索
-
正则表达式检索
-
模糊检索
-
ID检索
-
全文检索
-
复合检索
-
搜索基本信息
搜索请求的基本模块
query : 搜索请求最重要的组成部分.
size : 代表了返回文档的数量
from : 和size一起使用,fom用于分页操作. from的参数是从0开始的
_source : 指定
_source
字段如何返回.默认是返回完整的_source
字段.通过配置_source
,将过滤返回的字段sort : 默认的排序是基于文档的得分
基于URL的搜索请求
1 | curl 'localhost:9200/get-together/_search?from=10&size=10&sort=date:asc&_source=title,date&q=elasticsearch' |
基于请求主体的搜索请求
1 | curl 'localhost:9200/get-together/_search' -d |
-
_source
返回字段中通配符1
2
3
4
5
6
7
8
9
10
11
12curl 'localhost:9200/get-together/_search' -d
'{
"query": {
"match_all":{}
},
"_source": {
// 在搜索回复中返回以location开头的字段和日期字段
"include": ["location.*","deate"],
// 不要返回location.getolocation
"exclude": ["location.geolocation"]
}
}' -
结果的排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14curl 'localhost:9200/get-together/_search' -d
'{
"query": {
"match_all": {}
},
"sort": [
// 首先按照创建日期来排序 从最老到最新的
{"created_on": "asc"},
// 然后按照分组的名称来排序 按倒排的字母顺序
{"name": "desc"},
// 最终按照相关性得分(_score)来排序
"_score"
]
}' -
返回的结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37{
//查询所用的毫秒数
"took": 27,
// 表明是否分片超时,也就是说是否只返回了部分结果
"timed_out": false,
"_shards": {
// 成功响应该请求和未能成功响应该请求的分片数量
"total": 15,
"successful": 15,
"skipped": 0,
"failed": 0
},
// 返回中的包含了命中(hits)的键,其值是命中文档的数组
"hits": {
"total": {
// 该请求所有匹配结果的数量
"value": 3,
"relation": "eq"
},
// 这个搜索结果中的最大得分
"max_score": 2.0000367,
// 命中(hits)关键词元素中的命中文档数组
"hits": [
{
// 结果文档的索引
"_index": "xxxxx",
// 结果文档的Elasticsearch类型
"_type": "_doc",
// 结果文档的ID
"_id": "vZATIXYB5sSlphETLNfC",
// 结果的相关性得分
"_score": 2.0000367,
"_source": {}
}
]
}
}
ES检索分类
结构化检索
精确值检索
单个精确值检索: term query
- 字段值类型如果是
text
,会被视为full text
,会传递给解析器进行解析,当值为中文,空格,tab
键分隔的单词时,会被拆分成多个待查询的term
.比如"Query Name Space"可能会被解析为"[Query,Name,Space]"/ - 字段值类型如果是
keyword
,则视为精确值.精确值(如数字,日期,keyword)拥有准确的整定值.
1 | { |
1 | // JavaAPI: |
布尔过滤器: bool filter
1 | { |
多个精确值检索: terms query
1 | GET /kibana_sample_data_logs/_search |
Terms set query
Returns documents that contain a minimum number of exact terms in a provided field.
The
terms_set
query is the same as theterms
query, except you can define the number of matching terms required to return a document. For example:
- A field,
programming_languages
, contains a list of known programming languages, such asc++
,java
, orphp
for job candidates. You can use theterms_set
query to return documents that match at least two of these languages.- A field,
permissions
, contains a list of possible user permissions for an application. You can use theterms_set
query to return documents that match a subset of these permissions.
1 | PUT /job-candidates |
范围检索: range query
1 | // range查询 |
存在与否检索: exists query
1 | 返回包含字段索引值的文档。 |
前缀检索: prefix query
1 | // prefix查询 |
通配符检索: wildcard query
1 | This parameter supports two wildcard operators: |
正则表达式检索: regexp query
value中支持操作符:
.
Matches any character. For example:ab. # matches ‘aba’, ‘abb’, ‘abz’, etc.
?
Repeat the preceding character zero or one times. Often used to make the preceding character optional. For example:abc? # matches ‘ab’ and ‘abc’
Repeat the preceding character one or more times. For example:
ab+ # matches ‘ab’, ‘abb’, ‘abbb’, etc.
Repeat the preceding character zero or more times. For example:
ab* # matches ‘a’, ‘ab’, ‘abb’, ‘abbb’, etc.
{}
Minimum and maximum number of times the preceding character can repeat. For example:a{2} # matches ‘aa’
a{2,4} # matches ‘aa’, ‘aaa’, and ‘aaaa’
a{2,} # matches 'a` repeated two or more times|
OR operator. The match will succeed if the longest pattern on either the left side OR the right side matches. For example:abc|xyz # matches ‘abc’ and ‘xyz’
( … )
Forms a group. You can use a group to treat part of the expression as a single character. For example:abc(def)? # matches ‘abc’ and ‘abcdef’ but not ‘abcd’
[ … ]
Match one of the characters in the brackets. For example:[abc] # matches ‘a’, ‘b’, ‘c’
Inside the brackets, - indicates a range unless - is the first character or escaped. For example:[a-c] # matches ‘a’, ‘b’, or ‘c’
[-abc] # ‘-’ is first character. Matches ‘-’, ‘a’, ‘b’, or ‘c’
[abc-] # Escapes ‘-’. Matches ‘a’, ‘b’, ‘c’, or ‘-’
A ^ before a character in the brackets negates the character or range. For example:[^abc] # matches any character except ‘a’, ‘b’, or ‘c’
[^a-c] # matches any character except ‘a’, ‘b’, or ‘c’
[^-abc] # matches any character except ‘-’, ‘a’, ‘b’, or ‘c’
[^abc-] # matches any character except ‘a’, ‘b’, ‘c’, or ‘-’flags:参数值
ALL (Default)
Enables all optional operatorsCOMPLEMENT
Enables the ~ operator. You can use ~ to negate the shortest following pattern. For example:
a~bc # matches ‘adc’ and ‘aec’ but not ‘abc’INTERVAL
Enables the <> operators. You can use <> to match a numeric range. For example:
foo<1-100> # matches ‘foo1’, ‘foo2’ … ‘foo99’, ‘foo100’
foo<01-100> # matches ‘foo01’, ‘foo02’ … ‘foo99’, ‘foo100’INTERSECTION
Enables the & operator, which acts as an AND operator. The match will succeed if patterns on both the left side AND the right side matches. For example:
aaa.+&.+bbb # matches ‘aaabbb’ANYSTRING
Enables the @ operator. You can use @ to match any entire string.
You can combine the @ operator with & and ~ operators to create an “everything except” logic. For example:
@&~(abc.+) # matches everything except terms beginning with ‘abc’
1 | GET /kibana_sample_data_logs/_search |
模糊检索: fuzzy query
参数名 | 含义 |
---|---|
fuzziness |
定义最大的编辑距离,默认为AUTO,即按照es的默认配置。 fuzziness可选的值为0,1,2,也就是说编辑距离最大只能设置为2. AUTO策略: 在AUTO模式下,es将根据输入查询的term的长度决定编辑距离大小。用户也可以自定义term长度边界的最大和最小值,AUTO:[low],[high],如果没有定义的话,默认值为3和6,即等价于 AUTO:3,6,即按照以下方案: 输入查询term的长度: 0-2:必须精确匹配 3-5:编辑距离为1 >5:编辑距离为2 |
prefix_length |
定义最初始不会被“模糊”的term的数量。这是基于用户的输入一般不会在最开始犯错误的设定的基础上设置的参数。这个参数的设定将减少去召回限定编辑距离的的term时,检索的term的数量。默认参数为0. |
max_expansions |
定义fuzzy query会扩展的最大term的数量。默认为50. |
transpositions |
定义在计算编辑聚利时,是否允许term的交换(例如ab->ba),实际上,如果设置为true的话,计算的就是Damerau,F,J distance。默认参数为false。 |
1 | GET /kibana_sample_data_logs/_search |
ID检索: ids query
根据其ID返回文档。 该查询使用存储在_id字段中的文档ID。
1 | GET /kibana_sample_data_logs/_search |
全文检索
分词全文检索: match query
match
支持text/numerics/date
类型字段的查询-
1 | // 1. match查询,子弹值默认不区分大小写 |
短语检索: match_phrase query
- 短语检索:要求doc的该字段的值和你给定的值完全相同,顺序也不能变,所以它的精确度很高,但是召回率低。
- 类似
match
,不同的是无"分词"功能,等价于match..."operator":"and"
1 | GET /kibana_sample_data_logs/_search |
间隔检索:intervals query
Returns documents based on the order and proximity of matching terms.
The
intervals
query uses matching rules, constructed from a small set of definitions. These rules are then applied to terms from a specifiedfield
.The definitions produce sequences of minimal intervals that span terms in a body of text. These intervals can be further combined and filtered by parent sources.
根据匹配词的顺序和接近程度返回文档。
intervals
查询使用匹配规则,由一小部分定义构建而成。然后,将这些规则应用于来自指定字段
的术语。定义产生跨越正文中术语的最小间隔序列。这些间隔可以按父源进一步组合和过滤。(Required, rule object) Field you wish to search.
The value of this parameter is a rule object used to match documents based on matching terms, order, and proximity.
Valid rules include:
match
参数 描述 query 用户查询的字符串 max_gaps 字符串中每个词在text field中出现的最大词间距,超过最大间距的将不会被检索到;默认值是-1,即不限制,设置为0的话,query中的字符串必须彼此相连不能拆分 ordered query中的字符串是否需要有序显示,默认值是false,即不考虑先后顺序 analyzer 对query参数中的字符串使用什么分词器,默认使用mapping时该field配置的 search analyzer filter 可以为query搭配一个intervals filter,该filter不同于Boolean filter 有自己的语法结构
prefix
wildcard
fuzzy
all_of
参数 描述 intervals 一个interval集合,集合里面的所有match需要同时在一个文档数据上同时满足才行 max_gaps 多个interval查询在一个文档中允许的最大间距,超过最大间距的将不会被检索到;默认值是-1,即不限制,设置为0的话,所有的interval query必须彼此相连不能拆分 ordered 配置 intervals 出现的先后顺序,默认值false filter 可以为query搭配一个intervals filter,该filter不同于Boolean filter 有自己的语法结构
any_of
参数 描述 intervals 一个interval集合,集合里面的所有match不需要同时在一个文档数据上同时满足 filter 可以为query搭配一个intervals filter
1 | POST _search |
短语前缀检索: match_phrase_prefix query
query
(Required, string) Text you wish to find in the provided
<field>
.The
match_phrase_prefix
query analyzes any provided text into tokens before performing a search. The last term of this text is treated as a prefix, matching any words that begin with that term.
analyzer
(Optional, string) Analyzer used to convert text in the
query
value into tokens. Defaults to the index-time analyzer mapped for the<field>
. If no analyzer is mapped, the index’s default analyzer is used.
max_expansions
(Optional, integer) Maximum number of terms to which the last provided term of the
query
value will expand. Defaults to50
.
slop
(Optional, integer) Maximum number of positions allowed between matching tokens. Defaults to
0
. Transposed terms have a slop of2
.
zero_terms_query
(Optional, string) Indicates whether no documents are returned if the
analyzer
removes all tokens, such as when using astop
filter. Valid values are:
none
(Default)No documents are returned if the
analyzer
removes all tokens.
all
Returns all documents, similar to a
match_all
query.
1 | // 与match_phrase类似,不同点在于输入文本中,最后一项term允许前缀匹配 |
多字段匹配检索: multi_match query
The way the
multi_match
query is executed internally depends on thetype
parameter, which can be set to:
参数 说明 best_fields
(default) Finds documents which match any field, but uses the _score
from the best field. Seebest_fields
.优先返回某个field匹配到更多关键字的doc。most_fields
Finds documents which match any field and combines the _score
from each field. Seemost_fields
.优先返回有更多的field匹配到你给定的关键字的doc。most_fields不支持使用minimum_should_match
去长尾。cross_fields
Treats fields with the same analyzer
as though they were one big field. Looks for each word in any field. Seecross_fields
.phrase
Runs a match_phrase
query on each field and uses the_score
from the best field. Seephrase
andphrase_prefix
.phrase_prefix
Runs a match_phrase_prefix
query on each field and uses the_score
from the best field. Seephrase
andphrase_prefix
.bool_prefix
Creates a match_bool_prefix
query on each field and combines the_score
from each field. Seebool_prefix
.
1 | // multi_match与match类似,不同的是针对一个待查询的内容,可在多个指定名称字段中进行匹配 |
支持与或非的字符串检索: query_string
一个使用查询解析器解析其内容的查询。
query_string查询提供了以简明的简写语法执行多匹配查询 multi_match queries ,布尔查询 bool queries ,提升得分 boosting ,模糊匹配 fuzzy matching ,通配符 wildcards ,正则表达式 regexp 和范围查询 range queries 的方式。
1 | curl -XPOST 'http://localhost:9200/get-together/_search?pretty' -d |
简单化的字符串检索: simple_query_string
一个使用SimpleQueryParser解析其上下文的查询。 与常规query_string查询不同,simple_query_string查询永远不会抛出异常,并丢弃查询的无效部分。
1 | GET /kibana_sample_data_logs/_search |
-
An important difference between the
match_bool_prefix
query andmatch_phrase_prefix
is that thematch_phrase_prefix
query matches its terms as a phrase, but thematch_bool_prefix
query can match its terms in any position. The examplematch_bool_prefix
query above could match a field containingquick brown fox
, but it could also matchbrown fox quick
. It could also match a field containing the termquick
, the termbrown
and a term starting withf
, appearing in any position.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21GET /_search
{
"query": {
"match_bool_prefix" : {
"message" : "quick brown f"
}
}
}
==>
GET /_search
{
"query": {
"bool" : {
"should": [
{ "term": { "message": "quick" }},
{ "term": { "message": "brown" }},
{ "prefix": { "message": "f"}}
]
}
}
}
复合检索
固定得分检索:constant_score query
Wraps a filter query and returns every matching document with a relevance score equal to the
boost
parameter value.
1 | GET /kibana_sample_data_logs/_search |
-
bool组合检索
bool查询允许你在单独查询中组合任意数量的查询,执行查询子句表明哪些部分是必须(must)匹配,应该(should)匹配或者是不能(must_not)匹配上Elasticsearch索引里的数据
- 如果执行bool查询的某部分是must匹配,只有匹配上这些查询的结果才会被返回;
- 如果指定了bool查询的某部分是should匹配,只有匹配上指定数量子句的文档才会被返回;
- 如果没有执行must匹配的子句,文档至少要匹配一个should子句才能返回;
- must_not子句会使得匹配其的文档被移除结果集合;
bool查询子句 等价的二元操作 含义 must 为了组合多个子句,使用二元操作and
(query1 AND query2 AND query3)在must子句中的任何搜索必须匹配上文档,小写的and是功能,大写的AND是操作符 must_not 使用二元操作not组合多个子句 在must_not子句中任何搜索不能是文档的一部分.多个子句通过not二元操作符进行组合(NOT query1 AND NOT query2 AND NOT query3) should 使用二元操作or组合多个子句(query1 OR query2 OR query3) 在should子句中搜索,可以匹配也可以不匹配一篇文档,但是匹配数至少达到minimum_should_match参数所设置的数量(如果没有使用must那么默认私用1,如果使用了must,默认是0),和二元查找操作OR类型(query1 OR query2 OR query3) 类似 改变评分检索
-
Returns documents matching one or more wrapped queries, called query clauses or clauses.
If a returned document matches multiple query clauses, the
dis_max
query assigns the document the highest relevance score from any matching clause, plus a tie breaking increment for any additional matching subqueries.You can use the
dis_max
to search for a term in fields mapped with different boost factors.分离最大化查询(Disjunction Max Query) 。分离(Disjunction)的意思是 或(or) ,这与可以把结合(conjunction)理解成 与(and) 相对应。分离最大化查询(Disjunction Max Query)指的是: 将任何与任一查询匹配的文档作为结果返回,但只将最佳匹配的评分作为查询的评分结果返回 :
queries
(Required, array of query objects) Contains one or more query clauses. Returned documents must match one or more of these queries. If a document matches multiple queries, Elasticsearch uses the highest relevance score.
tie_breaker
(Optional, float) Floating point number between
0
and1.0
used to increase the relevance scores of documents matching multiple query clauses. Defaults to0.0
.You can use the
tie_breaker
value to assign higher relevance scores to documents that contain the same term in multiple fields than documents that contain this term in only the best of those multiple fields, without confusing this with the better case of two different terms in the multiple fields.If a document matches multiple clauses, the
dis_max
query calculates the relevance score for the document as follows:- Take the relevance score from a matching clause with the highest score.
- Multiply the score from any other matching clauses by the
tie_breaker
value. - Add the highest score to the multiplied scores.
If the
tie_breaker
value is greater than0.0
, all matching clauses count, but the clause with the highest score counts most.1
2
3
4
5
6
7
8
9
10
11
12GET /kibana_sample_data_logs/_search
{
"query": {
"dis_max": {
"queries": [
{ "term": { "url": "elastic" } },
{ "term": { "message": "elastic" } }
],
"tie_breaker": 0.7
}
}
} -
The
function_score
allows you to modify the score of documents that are retrieved by a query. This can be useful if, for example, a score function is computationally expensive and it is sufficient to compute the score on a filtered set of documents.To use
function_score
, the user has to define a query and one or more functions, that compute a new score for each document returned by the query.1
2
3
4
5
6
7
8
9
10
11GET /_search
{
"query": {
"function_score": {
"query": { "match_all": {} },
"boost": "5",
"random_score": {},
"boost_mode": "multiply"
}
}
} -
Returns documents matching a
positive
query while reducing the relevance score of documents that also match anegative
query.You can use the
boosting
query to demote certain documents without excluding them from the search results.-
positive
(Required, query object) Query you wish to run. Any returned documents must match this query.
-
negative
(Required, query object) Query used to decrease the relevance score of matching documents.If a returned document matches the
positive
query and this query, theboosting
query calculates the final relevance score for the document as follows:Take the original relevance score from thepositive
query.Multiply the score by thenegative_boost
value. -
negative_boost
(Required, float) Floating point number between
0
and1.0
used to decrease the relevance scores of documents matching thenegative
query.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19GET /kibana_sample_data_logs/_search
{
"query": {
"boosting": {
"positive": {
"terms": {
"tags": ["error","security","warning"]
}
},
"negative": {
"terms": {
"tags": ["success","info"]
}
},
"negative_boost": 0.5
}
},
"_source": ["tags"]
} -
Match all query,Match none query
1 | The most simple query, which matches all documents, giving them all a _score of 1.0. |
跨度检索 Span queries
- Span containing query
- Span field masking query
- Span first query
- Span multi-term query
- Span near query
- Span not query
- Span or query
- Span term query
- Span within query
特定检索 Specialized queries
- Distance feature query
- 相似文章检索 More like this query
- Percolate query
- Rank feature query
- 脚本检索 Script query
- Script score query
- Wrapper query
- Pinned Query
联接查询
- 嵌套查询 Nested query
- Has child query
- Has parent query
- Parent ID query
Geo类型检索
- 地理边框检索 Geo-bounding box query
- 地理距离查询 Geo-distance query
- 地理距离范围查询 Geo-polygon query
- 地理形状查询 Geo-shape query