Press "Enter" to skip to content

月与灯依旧 Posts

ElasticSearch 解决 429: es_rejected_execution_exception

ElasticSearch日志中出现如下内容:

[INFO ] 2020-06-04 11:04:45.093 [[main]>worker16] elasticsearch - retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [74000580][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[mail-w3svc1-2020.05.31][0]] containing [10] requests, target allocation id: 6OPCe3kOTWqG5k68lRozUA, primary term: 1 on EsThreadPoolExecutor[name = it_elk_node171/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@31aa695f[Running, pool size = 40, active threads = 40, queued tasks = 200, completed tasks = 31488344]]"})

原因分析:

为了防止集群过载,Elasticsearch在设计之初就限制了请求队列的大小, 从而增加了稳定性和可靠性。如果没有限制,客户端可以很容易地通过不良或恶意行为将整个群集关闭。所以在并发查询量大的情况下,访问流量超过了集群中单个Elasticsearch实例的处理能力,Elasticsearch服务端会触发保护性的机制,拒绝执行新的访问,并且抛出EsRejectedExecutionException异常。

这个保护机制与异常触发是由Elasticsearch API实现中的thread pool与配套的queue决定的. 在上面的错误日志中,Elasticsearch为index操作分配的线程池,pool size=40, queue capacity=200,当40个线程处理不过来,并且队列中缓冲的tasks超过200个,那么新的task就会被简单的丢弃掉,并且抛出EsRejectedExecutionException异常。

Elasticsearch将此条日志的级别定义为INFO, 说明它不是一个严重的问题, 根据ElasticSearch官方文档, 有时候甚至不用太关心, 但是依旧可以做一些有效的事来改善它.

定位与解决

根据ElasticSearch官方文档, 要解决429错误, 只有如下途径:

1, 暂停bulk write过程3-5秒钟,  然后重试.

2, 增加thread pool的write线程池相关参数.

thread pool的write线程池相关参数, 可以通过如下命令来获得:

curl -XGET "127.0.0.1:9200/_cat/thread_pool/write?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name      ip           name  active queue rejected completed core type  pool_size queue_size
it_elk_node169 172.29.4.169 write      0     0     1174  22801844      fixed        40        200
it_elk_node168 172.29.4.168 write      0     0     1811  32356307      fixed        40        200
it_elk_node167 172.29.4.167 write     30     0    11797  39593375      fixed        40        200
it_elk_node171 172.29.4.171 write      0     0    12304  28194291      fixed        40        200
it_elk_node170 172.29.4.170 write      1     0      743  25107047      fixed        40        200

其中, active和queue表示当前消耗的线程数与队列数, pool_size和queue_size表示系统设置的线程数和队列数.

从ElasticSearch 5开始, 无法再通过cluster API去调整thread pool的设置(官网说明):

Thread pool settings are now node-level settings. As such, it is not possible to update thread pool settings via the cluster settings API.

解决办法

size表示线程数, 不能超过CPU核数.

vim elasticsearch.yml    # 添加如下内容
thread_pool:
    write:
        size: 30
        queue_size: 1000

然后重启ES. 重启之后, 再看, 就会发现pool_size和queue_size的值发生了变化

curl -X GET "127.0.0.1:9200/_cat/thread_pool/write?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name      ip           name  active queue rejected completed core type  pool_size queue_size
it_elk_node167 172.29.4.167 write      0     0        0   1305481      fixed        40       3000
it_elk_node169 172.29.4.169 write      0     0        0   4266653      fixed        40        200
it_elk_node168 172.29.4.168 write      0     0     7280   7097169      fixed        40        200
it_elk_node170 172.29.4.170 write      0     0        0   2271912      fixed        40       3000
it_elk_node171 172.29.4.171 write      0     0        0   4377790      fixed        40       3000

同样的, 也可以通过类似的命令, 查看get/ccr(跨集群复制)/search的pool_size/queue_size

# 查看所有node详细的thread_pool
curl -X GET "127.0.0.1:9200/_nodes/thread_pool?pretty=true"


# 查看search的pool_size/queue_size
curl -X GET "127.0.0.1:9200/_cat/thread_pool/search?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name      ip           name   active queue rejected completed core type                  pool_size queue_size
it_elk_node167 172.29.4.167 search      0     0        0      8418      fixed_auto_queue_size        61       1000
it_elk_node169 172.29.4.169 search      0     0        0     64923      fixed_auto_queue_size        61       1000
it_elk_node168 172.29.4.168 search      0     0        0    111409      fixed_auto_queue_size        61       1000
it_elk_node170 172.29.4.170 search      0     0        0     50890      fixed_auto_queue_size        61       1000
it_elk_node171 172.29.4.171 search      0     0        0     38820      fixed_auto_queue_size        61       1000

# 查看get的pool_size/queue_size
curl -X GET "127.0.0.1:9200/_cat/thread_pool/get?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name      ip           name active queue rejected completed core type  pool_size queue_size
it_elk_node167 172.29.4.167 get       0     0        0       819      fixed        40       1000
it_elk_node169 172.29.4.169 get       0     0        0       331      fixed        40       1000
it_elk_node168 172.29.4.168 get       0     0        0      4756      fixed        40       1000
it_elk_node170 172.29.4.170 get       0     0        0         9      fixed         9       1000
it_elk_node171 172.29.4.171 get       0     0        0      6193      fixed        40       1000


# 查看ccr(跨集群复制)的pool_size/queue_size
curl -X GET "127.0.0.1:9200/_cat/thread_pool/ccr?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name      ip           name active queue rejected completed core type  pool_size queue_size
it_elk_node167 172.29.4.167 get       0     0        0       819      fixed        40       1000
it_elk_node169 172.29.4.169 get       0     0        0       331      fixed        40       1000
it_elk_node168 172.29.4.168 get       0     0        0      4756      fixed        40       1000
it_elk_node170 172.29.4.170 get       0     0        0         9      fixed         9       1000
it_elk_node171 172.29.4.171 get       0     0        0      6193      fixed        40       1000


# 查看所有thread pool
curl -XGET "127.0.0.1:9200/_cat/thread_pool?v&h=id,ip,name,queue,rejected,completed"
id                     ip           name                queue rejected completed
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 analyze                 0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 ccr                     0        0    140182
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 fetch_shard_started     0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 fetch_shard_store       0        0        60
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 flush                   0        0      8700
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 force_merge             0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 generic                 0        0   1267594
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 get                     0        0    158496
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 listener                0        0     16736
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 management              0        0   7373132
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 refresh                 0        0  15758014
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 rollup_indexing         0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 search                  0        0   1474802
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 search_throttled        0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 security-token-key      0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 snapshot                0        0         2
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 transform_indexing      0        0         0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 warmer                  0        0    749996
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 watcher                 0        0     23390
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 write                   0        0  34229769
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 analyze                 0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 ccr                     0        0     70051
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 fetch_shard_started     0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 fetch_shard_store       0        0        34
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 flush                   0        0      8912
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 force_merge             0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 generic                 0        0   1688212
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 get                     0        0     95389
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 listener                0        0     11281
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 management              0        0  11715898
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 refresh                 0        0  15957993
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 rollup_indexing         0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 search                  0        0    884079
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 search_throttled        0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 security-token-key      0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 snapshot                0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 transform_indexing      0        0         0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 warmer                  0        0    482303
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 watcher                 0        0     46692
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 write                   0        0  50279093
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 analyze                 0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 ccr                     0        0     70124
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 fetch_shard_started     0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 fetch_shard_store       0        0        50
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 flush                   0        0     14722
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 force_merge             0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 generic                 0        0   1473305
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 get                     0        0    113200
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 listener                0        0     12675
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 management              0        0  27809756
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 refresh                 0        0  12937481
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 rollup_indexing         0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 search                  0        0    987998
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 search_throttled        0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 security-token-key      0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 snapshot                0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 transform_indexing      0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 warmer                  0        0    670607
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 watcher                 0        0         0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 write                   0        0  69187927

 

Leave a Comment

ElasticSearch 7.x 解决 TooManyBucketsException 问题

ElasticSearch 7.x 版本出现如下提示: Caused by: org.elasticsearch.search.aggregations.MultiBucketConsumerService$TooManyBucketsException: Trying to create too many buckets. Must be less than or equal to: [10000] but was [10314]. This limit can be set by changing the [search.max_buckets] cluster level setting.

分析: 这是6.x以后版本的特性, 目的是限制大批量聚合操作, 规避性能风险.

解决方法1: 增加ElasticSearch的search.max_buckets限制

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent": { "search.max_buckets": 50000 }}'

解决方法2: 在时间间隔/文档数量上面加一些限制, 缩减buckets的数量

To minimize these either add change min time interval on datasource or panel level or either add min doc count on date histogram to 1.

原因分析

Elasticsearch官网关于buckets的解释如下:

the buckets effectively define document sets. In addition to the buckets themselves, the bucket aggregations also compute and return the number of documents that “fell into” each bucket.

简单说, bucket就是文档的数据集合. 我的理解是, 查询的结果集里, 有多少种不同类型的数据集, 就有多少个bucket. 下面我结合一个实例, 说明一下我理解的bucket是什么(可能我的理解不一定正确, 欢迎指正).

假设我有以下index. type的类型只可能有以下5种: query[A], query[AAAA], forwarded, reply, cached

@timestamp type
Jun 23, 2020 @ 19:32:45.000 query[AAAA]
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 query[A]
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 config
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 forwarded
Jun 23, 2020 @ 19:32:45.000 cached
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 reply
Jun 23, 2020 @ 19:32:45.000 cached

假设查询时间为过去15分钟

如果设置”Min time interval”为1s, 则一共有15*(60/1)=900个时间段, 而在每一个时间段里, 一共有5种不同的bucket, 这样会导致产生900*5=4500个bucket

如果设置”Min time interval”为30s, 则一共有15*(60/30)=30个时间段, 而在每一个时间段里, 一共有5种不同的bucket, 这样会导致产生30*5=150个bucket

这样也就很好的解释了, 为什么加大Min time interval的值, 可以解决这个问题.

参考文档:
Increasing max_buckets for specific Visualizations
ElasticSearch search_phase_execution_exception
ElasticSearch 7.x too_many_buckets_exception #17327

Leave a Comment

CentOS 7 安装配置 kafka

通常来说, logstash的处理能力有限, 为了防止高峰期日志数量太高导致kafka挂掉, 一般使用kafka来缓存日志消息.

kafka依赖zookeeper, 因此需要先安装配置zookeeper, 再安装配置kafka.

系统环境:

系统统一采用CentOS 7.8 64bit

IP地址 Zookeeper安装目录 Zookeeper DATA目录 kafka安装目录 kafka内网调用域名
172.29.4.168 /data/zookeeper /data/zookeeper/data /data/kafka kafka1.zhukun.net
172.29.4.169 /data/zookeeper /data/zookeeper/data /data/kafka kafka2.zhukun.net
172.29.4.170 /data/zookeeper /data/zookeeper/data /data/kafka kafka3.zhukun.net

1, 部署并配置Zookeeper

以下操作需要同时在3台服务器上操作

$ wget http://mirror.bit.edu.cn/apache/zookeeper/stable/apache-zookeeper-3.5.8-bin.tar.gz
$ tar zxvf apache-zookeeper-3.5.8-bin.tar.gz
$ mv apache-zookeeper-3.5.8-bin /data/
$ ln -s /data/apache-zookeeper-3.5.8-bin /data/zookeeper && mkdir /data/zookeeper/data

准备zk配置文件

$ vim /data/zookeeper/conf/zoo.cfg    # 写入如下配置
dataDir=/data/zookeeper/data
clientPort=2181
maxClientCnxns=0
admin.enableServer=false
# admin.serverPort=8080
initLimit=10
syncLimit=5
server.1=172.29.4.168:2888:3888
server.2=172.29.4.169:2888:3888
server.3=172.29.4.170:2888:3888

准备系统服务

Leave a Comment

ElasticSearch 解决 UNASSIGNED SHARDS

ElasticSearch出现UNASSIGNED SHARDS的解决办法

首先可以查看集群里有多少个未分配的分片, 以及分片是否均匀.

$ curl -XGET "172.18.192.100:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  2120        2.8tb     6.2tb     11.6tb     17.9tb           35 172.18.192.101 172.18.192.101 it-elk-node3
  3520        5.8tb     5.9tb       12tb     17.9tb           33 172.18.192.102 172.18.192.102 it-elk-node4
   764          1tb       2tb      9.3tb     11.3tb           17 172.18.192.100 172.18.192.100 it-elk-node2
  1707                                                                                         UNASSIGNED

一般来说, ES会自动将未分配的shards, 分配到各node上. 使用以下命令确定自动分配分片的功能是打开的

$ curl -XGET http://172.18.192.100:9200/_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "max_shards_per_node" : "20000"    # 一个node可以拥有最大20000个shards
    },
    "xpack" : {
      "monitoring" : {
        "collection" : {
          "enabled" : "true"
        }
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"    # 只要cluster.routing.allocation.enable是all的状态, ES就会自动分配shards
        }
      }
    }
  }
}

如果自动分配分片功能没有打开, 使用如下命令打开之

Leave a Comment

ElasticSearch提示too many open files

ElasticSearch提示too many open files, 如何去分析定位?

$ curl -XGET "172.18.192.100:9200/_nodes/stats/process?pretty"
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "it-elk",
  "nodes" : {
    "rBm53XWOTk-2v3MHPa2FDA" : {
      "timestamp" : 1589854287039,
      "name" : "it-elk-node3",
      "transport_address" : "172.18.192.101:9300",
      "host" : "172.18.192.101",
      "ip" : "172.18.192.101:9300",
      "roles" : [
        "ingest",
        "master",
        "data"
      ],
      "attributes" : {
        "ml.machine_memory" : "134778376192",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "process" : {
        "timestamp" : 1589854286789,
        "open_file_descriptors" : 59595,    # 当前打开的文件
        "max_file_descriptors" : 65535,     # 系统允许打开的最大文件
        "cpu" : {
          "percent" : 3,
          "total_in_millis" : 86105320
        },
        "mem" : {
          "total_virtual_in_bytes" : 1669361537024
        }
      }
    }

当然, 也可以从系统层面, 看一下当前限制

$ ps -ef | grep elasticsearch    # 找到进程的PID
elastic+ 128967      1 99 5月18 ?       1-13:22:07 /usr/share/elasticsearch/jdk/bin/java -Xms32g -Xmx32g -XX:+UseConcMarkSweepGC

$ cat /proc/128967/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             4096                 4096                 processes
Max open files            65535                65535                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       514069               514069               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

参考文档:
https://www.elastic.co/guide/en/elasticsearch/guide/master/_file_descriptors_and_mmap.html
ElasticSearch: Unassigned Shards, how to fix?

Leave a Comment