ElasticSearch 7.x 解决 TooManyBucketsException 问题

ElasticSearch 7.x 版本出现如下提示: Caused by: org.elasticsearch.search.aggregations.MultiBucketConsumerService$TooManyBucketsException: Trying to create too many buckets. Must be less than or equal to: [10000] but was [10314]. This limit can be set by changing the [search.max_buckets] cluster level setting.

分析: 这是6.x以后版本的特性, 目的是限制大批量聚合操作, 规避性能风险.

解决方法1: 增加ElasticSearch的search.max_buckets限制

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent": { "search.max_buckets": 50000 }}'

解决方法2: 在时间间隔/文档数量上面加一些限制, 缩减buckets的数量

To minimize these either add change min time interval on datasource or panel level or either add min doc count on date histogram to 1.

原因分析

Elasticsearch官网关于buckets的解释如下:

the buckets effectively define document sets. In addition to the buckets themselves, the bucket aggregations also compute and return the number of documents that “fell into” each bucket.

简单说, bucket就是文档的数据集合. 我的理解是, 查询的结果集里, 有多少种不同类型的数据集, 就有多少个bucket. 下面我结合一个实例, 说明一下我理解的bucket是什么(可能我的理解不一定正确, 欢迎指正).

假设我有以下index. type的类型只可能有以下5种: query[A], query[AAAA], forwarded, reply, cached

@timestamp type
Jun 23, 2020 @ 19:32:45.000 query[AAAA]
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 query[A]
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 config
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 forwarded
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 cached

假设查询时间为过去15分钟

如果设置”Min time interval”为1s, 则一共有15*(60/1)=900个时间段, 而在每一个时间段里, 一共有5种不同的bucket, 这样会导致产生900*5=4500个bucket

如果设置”Min time interval”为30s, 则一共有15*(60/30)=30个时间段, 而在每一个时间段里, 一共有5种不同的bucket, 这样会导致产生30*5=150个bucket

这样也就很好的解释了, 为什么加大Min time interval的值, 可以解决这个问题.

参考文档:
Increasing max_buckets for specific Visualizations
ElasticSearch search_phase_execution_exception
ElasticSearch 7.x too_many_buckets_exception #17327

ElasticSearch 解决 UNASSIGNED SHARDS

ElasticSearch出现UNASSIGNED SHARDS的解决办法

首先可以查看集群里有多少个未分配的分片, 以及分片是否均匀.

$ curl -XGET "172.18.192.100:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  2120        2.8tb     6.2tb     11.6tb     17.9tb           35 172.18.192.101 172.18.192.101 it-elk-node3
  3520        5.8tb     5.9tb       12tb     17.9tb           33 172.18.192.102 172.18.192.102 it-elk-node4
   764          1tb       2tb      9.3tb     11.3tb           17 172.18.192.100 172.18.192.100 it-elk-node2
  1707                                                                                         UNASSIGNED

一般来说, ES会自动将未分配的shards, 分配到各node上. 使用以下命令确定自动分配分片的功能是打开的

$ curl -XGET http://172.18.192.100:9200/_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "max_shards_per_node" : "20000"    # 一个node可以拥有最大20000个shards
    },
    "xpack" : {
      "monitoring" : {
        "collection" : {
          "enabled" : "true"
        }
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"    # 只要cluster.routing.allocation.enable是all的状态, ES就会自动分配shards
        }
      }
    }
  }
}

如果自动分配分片功能没有打开, 使用如下命令打开之 Continue reading “ElasticSearch 解决 UNASSIGNED SHARDS”

ElasticSearch提示too many open files

ElasticSearch提示too many open files, 如何去分析定位?

$ curl -XGET "172.18.192.100:9200/_nodes/stats/process?pretty"
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "it-elk",
  "nodes" : {
    "rBm53XWOTk-2v3MHPa2FDA" : {
      "timestamp" : 1589854287039,
      "name" : "it-elk-node3",
      "transport_address" : "172.18.192.101:9300",
      "host" : "172.18.192.101",
      "ip" : "172.18.192.101:9300",
      "roles" : [
        "ingest",
        "master",
        "data"
      ],
      "attributes" : {
        "ml.machine_memory" : "134778376192",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "process" : {
        "timestamp" : 1589854286789,
        "open_file_descriptors" : 59595,    # 当前打开的文件
        "max_file_descriptors" : 65535,     # 系统允许打开的最大文件
        "cpu" : {
          "percent" : 3,
          "total_in_millis" : 86105320
        },
        "mem" : {
          "total_virtual_in_bytes" : 1669361537024
        }
      }
    }

当然, 也可以从系统层面, 看一下当前限制

$ ps -ef | grep elasticsearch    # 找到进程的PID
elastic+ 128967      1 99 5月18 ?       1-13:22:07 /usr/share/elasticsearch/jdk/bin/java -Xms32g -Xmx32g -XX:+UseConcMarkSweepGC

$ cat /proc/128967/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             4096                 4096                 processes
Max open files            65535                65535                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       514069               514069               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

参考文档:
https://www.elastic.co/guide/en/elasticsearch/guide/master/_file_descriptors_and_mmap.html
ElasticSearch: Unassigned Shards, how to fix?

kibana使用的lucene查询语法

kibana使用的是lucene查询语法, 使用该语法不仅可以在kibana上使用, 也可以在Grafana中使用.

下面简单介绍一下使用方法.

全文搜索

在搜索栏输入login, 会返回所有字段值中包含login的文档
使用双引号包起来作为一个短语搜索

"like Gecko"

字段(Field)

也可以按页面左侧显示的字段搜索

field:value      # 限定字段全文搜索
filed:"value"    # 精确搜索, 关键字加上双引号
http_code:404    # 搜索http状态码为404的文档

字段本身是否存在

_exists_:http_host    # 返回结果中需要有http_host字段
_missing_:http_host   # 不能含有http_host字段

Continue reading “kibana使用的lucene查询语法”

Elasticsearch常用命令

1, Index

curl -X GET 'localhost:9200/_cat/indices?v'                          # 查看所有index

curl -X GET "localhost:9200/_cat/indices/INDEX_PATTERN-*?v&s=index"  # 查看某些名字的index

curl -X GET "localhost:9200/_cat/indices?v&s=docs.count:desc"        # 查看按容量大小排序的index

curl -XDELETE "localhost:9200/INDEX_NAME"                            # 删除某些index

curl -X GET "localhost:9200/INDEX_NAME/_count"                       # 查看某个INDEX的文档数量

curl -XGET "127.0.0.1:9200/_all/_settings?pretty=true "              # 查看所有Index的配置,会列出所有Index,很长...

curl -XGET "127.0.0.1:9200/office_dns_log-*/_settings?pretty=true"   # 查看某些Index的配置

curl -XGET "127.0.0.1:9200/office_dns_log-*/_settings/index.number_*?pretty=true" # 查看某些Index的shards和replicas数量

# 修改索引的replicas数量
curl -XPUT "localhost:9200/INDEX_NAME/_settings?pretty" -H 'Content-Type: application/json' -d' { "number_of_replicas": 2 }'

# 查看INDEX的mapping信息 curl -XGET "127.0.0.1:9200/INDEX_NAME/_mapping?include_type_name=true&pretty=true"

主要参考:
Get index settings API

2, Node

(这里主要参考了这篇文章.)

curl -X GET "localhost:9200/_cat/nodes?v"    # 查看各节点内存使用状态及负载情况
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.18.192.101           37          80   4    0.41    0.49     0.52 dim       -      it-elk-node3
172.18.192.102           69          90   4    1.19    1.53     1.56 dim       *      it-elk-node4
172.18.192.100           36         100   1    2.91    2.60     2.86 dim       -      it-elk-node2

curl -X GET "localhost:9200/_nodes/stats"
curl -X GET "localhost:9200/_nodes/nodeId1,nodeId2/stats"

# return just indices
curl -X GET "localhost:9200/_nodes/stats/indices"
# return just os and process
curl -X GET "localhost:9200/_nodes/stats/os,process"
# return just process for node with IP address 10.0.0.1
curl -X GET "localhost:9200/_nodes/10.0.0.1/stats/process"

# return just process
curl -X GET "localhost:9200/_nodes/process"
# same as above
curl -X GET "localhost:9200/_nodes/_all/process"
# return just jvm and process of only nodeId1 and nodeId2
curl -X GET "localhost:9200/_nodes/nodeId1,nodeId2/jvm,process"
# same as above
curl -X GET "localhost:9200/_nodes/nodeId1,nodeId2/info/jvm,process"
# return all the information of only nodeId1 and nodeId2
curl -X GET "localhost:9200/_nodes/nodeId1,nodeId2/_all"


# Fielddata summarised by node
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?fields=field1,field2"
# Fielddata summarised by node and index
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?level=indices&fields=field1,field2"
# Fielddata summarised by node, index, and shard
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?level=shards&fields=field1,field2"
# You can use wildcards for field names
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?fields=field*"

3, segment

# 查看所有INDEX的segment(注意如果INDEX较多, 这个列表可能很长)
curl -u elastic:HMEaQXtLiJaD4zn1ZxzM -X GET "127.0.0.1:9200/_cat/segments?v"

# 查看某个INDEX的segment
curl -u elastic:HMEaQXtLiJaD4zn1ZxzM -X GET "127.0.0.1:9200/_cat/segments/INDEX_PATTERN-*?v"

4, templates相关

template可以定义每个index的设置, 以及每个field的类型, 等等(仅对将来的INDEX有效, 不对现在已有的INDEX有效).

# 查看所有templates
curl -X GET "127.0.0.1:9200/_cat/templates?v&s=name"

# 查看某一个template
curl -X GET "127.0.0.1:9200/_template/template_1?pretty=true"

# 针对某INDEX设定一个template(仅针对未来创建的index有效)
curl -XPUT 127.0.0.1:9200/_template/template_1 -H 'Content-Type: application/json' -d'{
  "index_patterns": ["office_dns*"],
  "settings" : {
    "index.refresh_interval": "30s",
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.translog.durability": "request",
    "index.translog.sync_interval": "30s"
  },
  "order" : 1
}'

# 提示:
index.refresh_interval: INDEX的刷新间隔, 默认为1s, 即写入的数据经过多少可以在ES中搜索到.
number_of_shards: 分片的数量(重要), 建议设置为node数量, 如果一个index的容量超过了30G, 会导致查询速度很慢, 此时一定要通过shards数量来分散index
number_of_replicas: 副本数量
index.translog.durability: 将translog数据(包括index/update/delete等)待久化至硬盘的方式,request是系统默认方式.
index.translog.sync_interval:translog提交间隔, 默认是5s

参考文档: https://blog.csdn.net/u014646662/article/details/99293604

5, License相关

# 查看License
curl -XGET 'http://127.0.0.1:9200/_license'

# 删除License
curl -X DELETE "localhost:9200/_license"

# 导入License(本地的License文件为aaa.json), 如果启用了用户名/密码, 这里需要加上用户密码, 例如-u elastic:password
curl -XPUT 'http://127.0.0.1:9200/_xpack/license' -H "Content-Type: application/json" -d @aaa.json

6, shards管理

curl -XGET "127.0.0.1:9200/_cluster/settings?pretty"    # 查看集群最大分片数量
{
  "persistent" : {
    "cluster" : {
      "max_shards_per_node" : "30000"    # 单个节点能容纳30000个shards,默认值是1000
    }
    "xpack" : 
      "monitoring" : 
        "collection" : {
          "enabled" : "true"
        }
      }
    }
  }
}


curl -XGET "127.0.0.1:9200/_cluster/health?pretty"    # 查看当前shards使用数量
{
  "cluster_name" : "it-elk",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 7233,
  "active_shards" : 7248,
  "relocating_shards" : 0,
  "initializing_shards" : 12,
  "unassigned_shards" : 5323,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 2085,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 3167844,
  "active_shards_percent_as_number" : 57.601525868234916        #数据的正常率,100表示一切ok
}

# 查看未分配的分片
curl -XGET "127.0.0.1:9200/_cat/shards?h=index,shard,prirep,state,unassigned.*&pretty"|grep UNASSIGNED | wc -l

# 查看未分配分片, 以及未分配原因
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

7, template管理

# 查看所有templates
curl -X GET "127.0.0.1:9200/_cat/templates?v&s=name"

# 查看某一个template
curl -X GET "127.0.0.1:9200/_template/TEMPLATE_NAME?pretty=true"

# 设定一个名为mail的template, 使得以后的mail-w3svc1-*索引具备以下设定
curl -u -XPUT 127.0.0.1:9200/_template/mail -H 'Content-Type: application/json' -d'{
  "index_patterns": ["mail-w3svc1-*"],
  "settings" : {
    "index.refresh_interval": "30s",
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.translog.durability": "request",
    "index.translog.sync_interval": "30s"
  },
  "mappings": {
    "properties": {
      "rt": { "type": "integer" },
      "status": { "type": "integer" },
      "width": { "type": "float" }
      "uri": { "type": "text", "fielddata": true },
      "username": { "type": "text", "fielddata": true },
      "server_ip": { "type": "text", "fielddata": true },
      "client_ip": { "type": "text", "fielddata": true }
    }
  },
  "order" : 1
}'

# 提示:
1, index.refresh_interval: INDEX的刷新间隔, 默认为1s, 即写入的数据经过多少可以在ES中搜索到
由于刷新是很耗费资源的行为, 初次导入大量数据时, 可转设置长一点(如30s)等, 后来再改成5s或者10s.
2, number_of_shards: 主分片的数量(重要), 默认为1
如果一个index的容量超过了30G, 会导致查询速度很慢, 此时可以通过shards数量来分散index
3, number_of_replicas: 副本数量, 默认为1.
请注意, 如果一个INDEX的shards为5, 而replicas的话, 会导致这个INDEX一共有5*(1+1)个分片, 会拖慢集群性能.
建议: node数量<=shards数量*(replicas数量+1)
4, index.translog.durability: 将translog数据(包括index/update/delete等操作)待久化至硬盘的方式
5, index.translog.sync_interval: translog提交间隔, 默认是5s
6, mappings.properties: 表示rt/status这几个fields的类型为int类型,并开启username/server_ip等字段的外部脚本的聚合功能
常见字段类型可参考https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
# 查看上面设定的mail模板
curl -X GET "127.0.0.1:9200/_template/mail?pretty=true"

8, Document

# 写入一条Document(如果index不存在则会自动创建)
curl -XPOST "127.0.0.1:9200/INDEX_NAME/_doc/1" -H "Content-Type: application/json" -d'{"name": "zhu kun"}'
 
# 查看一条Document
curl -XGET "127.0.0.1:9200/INDEX_NAME/_doc/1?pretty=true"
 
# 搜索一条数据
curl -XGET "127.0.0.1:9200/INDEX_NAME/_search?q=name:zhu&pretty=true"

9, 特别注意

一般text类型的field, 仅能在kibana中进行排序和搜索, 如果需要在脚本(如Grafana)中进行聚合(排序,统计,汇总等)操作,则需要设定fielddata为ture, 否则可能会报出如下的错误(参考官网这个文档):

Fielddata is disabled on text fields by default. Set `fielddata=true` on [`your_field_name`] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.

参考文档:
https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/#reason-2-too-many-shards-not-enough-nodes