ElasticSearch 查询基础

摘要

本文将主要聚焦于 ElasticSearch 数据层面的使用，包括数控创建，查询，聚合等

基本概念

Node 和 Cluster

ElasticSearch（ES) 一个基于开源搜索引擎 Lucene 实现的一个分布式数据库，可以快速的存储查询海量数据。

Node(节点) 是一个运行的 ES 单点实例，提供存储相关数据的功能。
多有个具有相同的 cluster.name 的 Node 组成 Cluster(集群)。他们共同协作，共享数据并提供故障转移和扩展功能。

Shards (分片)

Shards 表示 ES 数据的切片，作用是吧整个数据分成多个分片，分散的存储在多个节点上。可以理解成 mysql 中分库分表的概念
默认情况下：ES 会将 Index（索引）分成 5 个分片

Index（索引）

ES 如果将 Index 作为名词的话，其表示一个数据库，类似于关系型数据库中库（database) 的概念
作为动词的话，Index 把一个文档（Document) 存入一个 Index(名词) 中，使其可以被检索。

Document (文档)

Index 中单条记录被称为 Document, Index 由多条 Document 组成。对比于关系型数据库 Document 就是数据库中一条数据
ES 中 Document 是由 JSON 格式表示, JSON 中的 Field (字段)可以对应关系型数据库中表的 Column(列) 的概念。

Type (类型)

ES 中 Document 需要归属于某一种 Type，是一种虚拟逻辑分组。对比于关系型数据库中的 Table(表) 的概念

概括地说，存储层面：一个 ES 集群由多个 Node 组成，Node 中由多个 Shards 组成。逻辑成层面：一个 ES 集群可以包含多个 Index(数据库)，每一个 Index 可以包含多个 Type(表)，一个类型下包含多个 Document(行)，每一行中包含多个 Field(列)。

基本操作

插入数据(创建索引，类型，文档)

索引可以通过简单的添加一个文档的形式创建索引，这个索引使用默认设置，新的属性通过动态映射的方式添加到分类中

request:
curl -L -X PUT -H "Content-Type:application/json" -d \
'{
  	"first_name": "John",
	"last_name": "Smith",
	"age": 25,
	"about": "I love to go rock climbing", 
  	"interests": [ "sports", "music" ],
  	"education": {
    	"degree": "master",
    	"major": "es"
  	}
}' \
'/ssltd/employee/1'
 
response:
{
    "_index":"ssltd",
    "_type":"employee",
    "_id":"1",
    "_version":1,
    "result":"created",
    "_shards":{
        "total":2,
        "successful":1,
        "failed":0
    },
    "_seq_no":2,
    "_primary_term":1
}

返回 http code 等于 201 说明创建成功。如果失败，则返回 400 获取其他状态码，body 会返回相应的错误信息
以上命令就创建一个索引 ssltd, 一种类型 employee, 一条 id 为1的文档。从创建索引，类型，文档一步完成。当然，这个最粗糙的操作。

按id检索文档

ES 中检索按id检索文档比较简单，只要执行HTTP GET请求并指出文档的“地址”——索引、类型和id。

request:
curl -X GET 'http://localhost:9200/ssltd/employee/1?pretty

response:
{
  "_index" : "ssltd",
  "_type" : "employee",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "first_name" : "John",
    "last_name" : "Smith",
    "age" : 25,
    "about" : "I love to go rock climbing",
    "interests" : [
      "sports",
      "music"
    ],
    "education" : {
      "degree" : "master",
      "major" : "es"
    }
  }
}

响应的内容中包含一些文档的元信息，例如 _index 表示索引名， _type 表示类型名， _id 即文档id，_version 表示更新版本号，当前文档每更新一次，版本号就会加 1。
found 表示查询结果是否为空， false 表示没有查到相应结果。文档的有效信息都包含在 _source 字段中。

简单查询

1、一个查询全部文档的请求:
查询方式同按 id 获取类似，只是在结尾使用关键字 _search 来取代原来的文档ID

request:
curl -X GET '/ssltd/employee/_search?pretty'

response:
{
  "took" : 66,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "ssltd",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ],
          "education" : {
            "degree" : "master",
            "major" : "es"
          }
        }
      },
      ...
    ]
  }
}

响应的结果在 hits 数组中，默认情况下搜索会返回前10个结果。

2、按条件查询
例：查询一个 last_name 为 Smith 的文档

request:
curl -X GET 'http://localhost:9200/ssltd/employee/_search?pretty&q=last_name:Smith'

response:
{
  "took" : 28,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4700036,
    "hits" : [
      {
        "_index" : "ssltd",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.4700036,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ],
          "education" : {
            "degree" : "master",
            "major" : "es"
          }
        }
      },
      ...
    ]
  }
}

使用DSL语句查询

ES 提供丰富且灵活的查询语言叫做DSL(Domain Search Language)查询(Query DSL),它允许构建更加复杂、强大的查询， DSL 以JSON请求体的形式出现.
例：查询一个 last_name 为 Smith 的文档

request:
curl -L -X POST -H "Content-Type:application/json" -d \
'{
  "query": {
    "match": {
      "last_name": "Smith"
    }
  }
}' \
'/ssltd/employee/_search?pretty'
 
response
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4700036,
    "hits" : [
      {
        "_index" : "ssltd",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.4700036,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ],
          "education" : {
            "degree" : "master",
            "major" : "es"
          }
        }
      },
      {
        "_index" : "ssltd",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.4700036,
        "_source" : {
          "first_name" : "lucy",
          "last_name" : "Smith",
          "age" : 24,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ],
          "education" : {
            "degree" : "bachelor",
            "major" : "cs"
          }
        }
      }
    ]
  }
}

空查询

空查询将会返回所有索引中所有的文档

1	curl -X POST -H "Content-Type:application/json" -d {} '/_search?pretty'

空查询，也可以指定索引和类型

1	curl -X POST -H "Content-Type:application/json" -d {} '/index_2014*/type1,type2/_search?pretty'

也可以使用 from 及 size 参数进行分页

curl -X POST -H "Content-Type:application/json" -d \
'{
    "from": 2,
    "size": 10
}' \
'/_search?pretty'

Query DSL 语法

使用 DSL 查询，需要传递 query 参数：

curl -X POST -H "Content-Type:application/json" -d \
'{
    "query": YOUR_QUERY_HERE
}' \
'/_search?pretty'

空查 {}, 用 DSL 可以使用 match_all 查询子句，匹配所有文档：

curl -X POST -H "Content-Type:application/json" -d \
'{
    "query": {
        "match_all": {}
    }
}' \
'/_search?pretty'

一个查询子句结构，即简单子句

{
    QUERY_NAME: {
        ARGUMENT: VALUE,
        ARGUMENT: VALUE,
        ...
    }
}

或指向一个指定的字段

{
    QUERY_NAME: {
        FIELD_NAME: {
            ARGUMENT: VALUE,
            ARGUMENT: VALUE,
            ...
        }
    }
}

例如：查询用户 about 字段中包含 rock 文档，可以采用如下查询

request:
curl -X POST -H "Content-Type:application/json" -d \
'{
    "query": {
        "match": {
            "about": "rock"
        }
    }
}' \
'/ssltd/employee/_search?pretty'

response:
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4589591,
    "hits" : [
      {
        "_index" : "ssltd",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.4589591,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ],
          "education" : {
            "degree" : "master",
            "major" : "es"
          }
        }
      },
      ...
    ]
  }
}

多个简单子句可以组合成一个复杂的查询子句，复合子句间也可以相互嵌套，从而实现复查的查询逻辑
例如：bool 关键字允许合并其他子句

curl -L -X POST -H "Content-Type:application/json" -d \
'{
  "query": {
    "bool": {
        "must": {"match": {"last_name": "Smith"}},
        "must_not": {"match": {"about": "build"}},
      	"should": {"match": {"education.degree": "master"}}
    }
  }
}' \
'/ssltd/employee/_search?pretty'

过滤关键字

Elasticsearch 提供了丰富的查询过滤语句

1、 term 过滤
term 主要用于精确匹配哪些值，比如数字，日期，布尔值或 not_analyzed 的字符串(未经分析的文本数据类型)

{
    "term": {"age": 32},
    "term": {"education.major": "cs"}
}

2、terms 过滤
terms 跟 term 有点类似，但 terms 允许指定多个匹配条件。如果某个字段指定了多个值，那么文档需要一起去做匹配

1
2
3

{
    "terms": {"interests": ["music", "sports"]}
}

3、range 过滤
range 过滤允许我们按照指定范围查找一批数据

{
    "range": {
        "age": {
            "gte": 20,
            "lt": 30
        }
    }
}

范围操作符包含: gt: 大于，gte: 大于等于，lt: 小于，lte 小于等于

4、exists 和 missing 过滤
exists 过滤可以用于查找文档中是否某个字段值不为 null 或 []，missing 过滤可以用于查找文档中某个字段值为 null 或 []，类似于 SQL 语句中红的 IS_NULL 条件. 注意：空字符串 “” 不为 exists 判定为 true.

{
    "exists": {
        "field": "about"
    }
}

例：

curl  -L -X POST -H "Content-Type:application/json" 
'{
  "query":{
    "exists": {
      "field": "education"
    }
  }
}' \
'/ssltd/employee/_search'

5、 bool 过滤
bool 过滤可以合并多个过滤条件，其包含如下相关关键字,

must: 相当于 and, 多个查询条件需同时匹配
must_not: 相当于 not, 多个查询条件都不匹配
should: 相当于 or, 至少匹配一个查询条件

查询关键字

1、match_all 查询
使用 match_all 可以查询 ES 集群中所有的文档，是没有查询条件下的默认语句

1
2
3

{
    "match_all": {}
}

2、match 查询
match 查询是一个标准查询，不管是使用全文检索还是精确查询都会用到它。
如果是全文检索，其会在查询之前使用分词解析器（analysis) 生成 match 的字符

{
    "match" {
        "about": "rock"
    }
}

如果用 match 指定了一个确切值，在遇到数字，日期，布尔值或者 not_analyzed 的字符串时，它将会搜索给定的值

{
    "match": {
        "age": 29
    }
}

3、multi_match 查询
multi_match 允许在 match 查询的基础上同时搜索多个字段

{
    "multi_match": {
        "query": "rock",
        "fields": ["interests", "about"]
    }
}

例：

curl -L -X POST -H "Content-Type:application/json" -d \
'{
  "query": {
    "multi_match": {
      "query": "rock",
      "fields": ["interests", "about"]
    }
  }
}' \
'http://localhost:9200/ssltd/employee/_search'

4、bool 查询
bool 查询与 bool 过滤相似，用于合并多个查询子句。不同的是， bool 过滤可以直接给出是否匹配成功，而 bool 查询要计算每一个查询子句的 _score (相关性分值)。

must 查询指定文档一定要被包含
must_not 查询指定文档一定不要被包含
should 查询指定文档，有则可以为文档相关性加分

Note: 如果 bool 查询下没有 must 子句，那至少应该有一个 should 子句。但是如果有 must 子句，那么没有 should 子句也可以进行查询。

查询与过滤条件的合并

查询语句和过滤语句可以放在各自的上下文中，这些语句既可以包含单条 query 语句，也可以包含一条 filter 子句，或者说这些语句需要首先创建一个 query 或 filter 的上下文关系

摘要