ElasticSearch

cooscao 5月 29, 2019

ElasticSearch入门

ElasticSearch是一个实时分布式搜索和分析引擎。它让你以前所未有的速度处理大数据成为可能。它用于全文搜索、结构化搜索、分析以及将这三者混合使用。Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎。无论在开源还是专有领域， Lucene可以被认为是迄今为止最先进、性能最好的、功能最全的搜索引擎库。

安装并运行ElasticSearch

安装Elasticsearch首先需要安装java，去官网下载1.8版本即可。

到es官网下载最新版本的Elasticsearch

curl -L -O http://download.elasticsearch.org/PATH/TO/VERSION.zip <1>
unzip elasticsearch-$VERSION.zip
cd elasticsearch-$VERSION

新版的ElasticSearch不推荐使用root用户运行，所以如果在docker内运行的话，需要创建新用户
```
sudo adduser elasticsearch
su elasticsearch
```
切换到新用户之后启动ElasticSearch
```
./bin/elasticsearch
```

测试成功启动

curl 'http://localhost:9200/?pretty'

若返回以下信息则表示Elasticsearch成功启动

{
"status": 200,
"name": "Shrunken Bones",
"version": {
"number": "1.4.0",
"lucene_version": "4.10"
},
"tagline": "You Know, for Search"
}

创建索引

Elasticsearch是面向文档(document oriented)的，这意味着它可以存储整个对象或文档 (document)。然而它不仅仅是存储，还会索引(index)每个文档的内容使之可以被搜索。在 Elasticsearch中，你可以对文档（而非成行成列的数据）进行索引、搜索、排序、过滤。这种理解数据的方式与以往完全不同，这也是Elasticsearch能够执行复杂的全文搜索的原因之一。将Es中信息与传统关系数据库比较

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices -> Types -> Documents -> Fields

Elasticsearch集群可以包含多个索引(indices)（数据库），每一个索引可以包含多个类型 (types)（表），每一个类型包含多个文档(documents)（行），然后每个文档包含多个字段 (Fields)（列）。

在bash下使用curl命令创建索引并指定中文分词器ik_smart

curl -H'Content-Type: application/json' -XPUT 'localhost:9200/es_test' -d '
{
  "mappings": {
    "posts": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_smart",
          "search_analyzer": "ik_smart"
        },
        "desc": {
          "type": "text",
          "analyzer": "ik_smart",
          "search_analyzer": "ik_smart"
        },
        "answers":{
          "type": "text",
          "analyzer": "ik_smart",
          "search_analyzer": "ik_smart"
        }
      }
    }
  }
}'

这样就成功创建了一个索引es_test，而type为posts。

python常规操作Elasticsearch

1. 使用bulk批量导入数据

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
es = Elasticsearch()
actions = []
# all_json是要导入的数据
for lidx, sample in enumerate(all_json):
    action = {
        "_index": "es_test",
        "_type": "posts",
        "_source":{
        "title": sample['title'],
        "desc": sample['desc']
        "answers": sample['content']
        }
    }
    actions.append(action)
    if lidx % 500 == 0:   # 每500组导入一次
        res, _ = bulk(es, actions, index="es_test", raise_on_error=True)
        # print(res)
        actions = []

2. 查询所有内容是否导入

res = es.search(index="es_test", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])

####
Got 7001 Hits:

3. 写入数据和读取数据

向Elasticsearch写入一条数据，不指定id，即可随机得到id。当索引不存在时会根据doc字段自动创建索引

from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()

doc = {
    'author': 'kimchy',
    'text': 'Elasticsearch: cool. bonsai cool.',
    'timestamp': datetime.now(),
}
# 当索引不存在时会根据doc字段
res = es.index(index="test-index", doc_type='tweet', id=1, body=doc)
print(res['result'])

### 
created

读取刚才创建的数据，返回形式同样是json格式

res = es.get(index="test-index", doc_type='tweet', id=1)
print(res['_source'])

###
{'author': 'kimchy', 'text': 'Elasticsearch: cool. bonsai cool.', 'timestamp': '2019-04-04T06:33:37.442271'}

此时查询这个indices将得到

# 查询
res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])

### output
Got 1 Hits

4. 查询

使用上面的创建方法插入3条数据，供查询

body1={
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
#余下代码为写入三段数据
body2={
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

body3={
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}
res1 = es.index("test-index1", doc_type="employee", id=1, body=body1)
re2 = es.index("test-index1", doc_type="employee", id=2, body=body2)
re3 = es.index("test-index1", doc_type="employee", id=3, body=body3)

查询姓氏为Smith的字段

bb1 = {
    "query": {
        "match": {"last_name": "Smith"}
    }
}
rt1 = es.search(index="test-index1", body=bb1)
print(rt1)

输出查询到的两个员工，由于他们姓氏满足Smith。所以分数相同

{'took': 48,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 2,
  'max_score': 0.2876821,
  'hits': [{'_index': 'test-index1',
    '_type': 'employee',
    '_id': '2',
    '_score': 0.2876821,
    '_source': {'first_name': 'Jane',
     'last_name': 'Smith',
     'age': 32,
     'about': 'I like to collect rock albums',
     'interests': ['music']}},
   {'_index': 'test-index1',
    '_type': 'employee',
    '_id': '1',
    '_score': 0.2876821,
    '_source': {'first_name': 'John',
     'last_name': 'Smith',
     'age': 25,
     'about': 'I love to go rock climbing',
     'interests': ['sports', 'music']}}]}}

在查询姓氏的基础上同时给上年龄限制

bb2 = {
    "query": {
        "bool":{
            "must": {"match" :{"last_name": "Smith"}},
            "filter":{"range":{"age": {"gt": 30}}}
        }
    }
}
rt2 = es.search(index="test-index1", body=bb2)
print(rt2)

此时输出的只有满足年龄大于30的这个员工

{'took': 1019,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 1,
  'max_score': 0.2876821,
  'hits': [{'_index': 'test-index1',
    '_type': 'employee',
    '_id': '2',
    '_score': 0.2876821,
    '_source': {'first_name': 'Jane',
     'last_name': 'Smith',
     'age': 32,
     'about': 'I like to collect rock albums',
     'interests': ['music']}}]}}

全文搜索 — 传统数据库很难实现的功能

## 全文搜索
all_search = {
    "query":{
        "match":{
            "about":"rock climbing"
        }
    }
}
rt3 = es.search(index="test-index1", body=all_search)
print(rt3)

此时将会得到两个匹配的字段，它们分别有一个对于搜索得到的评分

{'took': 15,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 2,
  'max_score': 0.5753642,
  'hits': [{'_index': 'test-index1',
    '_type': 'employee',
    '_id': '1',
    '_score': 0.5753642,
    '_source': {'first_name': 'John',
     'last_name': 'Smith',
     'age': 25,
     'about': 'I love to go rock climbing',
     'interests': ['sports', 'music']}},
   {'_index': 'test-index1',
    '_type': 'employee',
    '_id': '2',
    '_score': 0.2876821,
    '_source': {'first_name': 'Jane',
     'last_name': 'Smith',
     'age': 32,
     'about': 'I like to collect rock albums',
     'interests': ['music']}}]}}

多字段查询

要对于多字段同时对于一个条目查询匹配程度，在此种方法下可以设置boost，则相应字段的查询匹配权重将会增加。

query = "头痛了怎么办"
all_search = {
    "query":{
        "match":{
            "title":query,
            "boost":2  # 标题权重设置为2，默认为1
        },
        "match":{
            "desc":query
        }
    }
}
rt = es.search(index="qa_test", body=all_search)

另外一种方法是布尔查询，采用的是“匹配越多越好(More-matches-is-better)”的方法，所以每个match子句的得分会被加起来变成最后的每个文档的得分。匹配两个子句的文档的得分会比只匹配了一个文档的得分高。

query = "头痛了怎么办"
all_search = {
    "query": {
        "bool":{
            "should":[
                {"match": {"title": query}},  # 同样可以使用boost:2来提升权重
                {"match": {"desc": query}}
            ]
        }
}
}
rt = es.search(index="qa_test", body=all_search)

这两种方法都可以对于多字段查询，但它们查询出来得到的结果将会不同。其它查询方法还包括dis_max查询，多重匹配查询，具体使用可查看文档。

常用的curl操作Elasticsearch

1.创建索引

curl -XPUT 'localhost:9200/<indice_name>?pretty'

2. 查询所有的索引

curl 'localhost:9200/_cat/indices?v'

3.删除指定索引

curl -XDELETE http://localhost:9200/<indice_name>

4.向索引里插入一条数据

curl -H'Content-Type: application/json' -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
"name":"John Doe"
}'

5. 根据编号获取一条数据

curl -XGET 'localhost:9200/customer/external/1?pretty'

BUG

怎么解决“FORBIDDEN/12/index read-only”

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

sudo sysctl -w vm.max_map_count=262144