ElasticSearch日志中出现如下内容:
[INFO ] 2020-06-04 11:04:45.093 [[main]>worker16] elasticsearch - retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [74000580][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[mail-w3svc1-2020.05.31][0]] containing [10] requests, target allocation id: 6OPCe3kOTWqG5k68lRozUA, primary term: 1 on EsThreadPoolExecutor[name = it_elk_node171/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@31aa695f[Running, pool size = 40, active threads = 40, queued tasks = 200, completed tasks = 31488344]]"})
原因分析:
为了防止集群过载,Elasticsearch在设计之初就限制了请求队列的大小, 从而增加了稳定性和可靠性。如果没有限制,客户端可以很容易地通过不良或恶意行为将整个群集关闭。所以在并发查询量大的情况下,访问流量超过了集群中单个Elasticsearch实例的处理能力,Elasticsearch服务端会触发保护性的机制,拒绝执行新的访问,并且抛出EsRejectedExecutionException异常。
这个保护机制与异常触发是由Elasticsearch API实现中的thread pool与配套的queue决定的. 在上面的错误日志中,Elasticsearch为index操作分配的线程池,pool size=40, queue capacity=200,当40个线程处理不过来,并且队列中缓冲的tasks超过200个,那么新的task就会被简单的丢弃掉,并且抛出EsRejectedExecutionException异常。
Elasticsearch将此条日志的级别定义为INFO, 说明它不是一个严重的问题, 根据ElasticSearch官方文档, 有时候甚至不用太关心, 但是依旧可以做一些有效的事来改善它.
定位与解决
根据ElasticSearch官方文档, 要解决429错误, 只有如下途径:
1, 暂停bulk write过程3-5秒钟, 然后重试.
2, 增加thread pool的write线程池相关参数.
thread pool的write线程池相关参数, 可以通过如下命令来获得:
curl -XGET "127.0.0.1:9200/_cat/thread_pool/write?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name ip name active queue rejected completed core type pool_size queue_size
it_elk_node169 172.29.4.169 write 0 0 1174 22801844 fixed 40 200
it_elk_node168 172.29.4.168 write 0 0 1811 32356307 fixed 40 200
it_elk_node167 172.29.4.167 write 30 0 11797 39593375 fixed 40 200
it_elk_node171 172.29.4.171 write 0 0 12304 28194291 fixed 40 200
it_elk_node170 172.29.4.170 write 1 0 743 25107047 fixed 40 200
其中, active和queue表示当前消耗的线程数与队列数, pool_size和queue_size表示系统设置的线程数和队列数.
从ElasticSearch 5开始, 无法再通过cluster API去调整thread pool的设置(官网说明):
Thread pool settings are now node-level settings. As such, it is not possible to update thread pool settings via the cluster settings API.
解决办法
size表示线程数, 不能超过CPU核数.
vim elasticsearch.yml # 添加如下内容
thread_pool:
write:
size: 30
queue_size: 1000
然后重启ES. 重启之后, 再看, 就会发现pool_size和queue_size的值发生了变化
curl -X GET "127.0.0.1:9200/_cat/thread_pool/write?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name ip name active queue rejected completed core type pool_size queue_size
it_elk_node167 172.29.4.167 write 0 0 0 1305481 fixed 40 3000
it_elk_node169 172.29.4.169 write 0 0 0 4266653 fixed 40 200
it_elk_node168 172.29.4.168 write 0 0 7280 7097169 fixed 40 200
it_elk_node170 172.29.4.170 write 0 0 0 2271912 fixed 40 3000
it_elk_node171 172.29.4.171 write 0 0 0 4377790 fixed 40 3000
同样的, 也可以通过类似的命令, 查看get/ccr(跨集群复制)/search的pool_size/queue_size
# 查看所有node详细的thread_pool
curl -X GET "127.0.0.1:9200/_nodes/thread_pool?pretty=true"
# 查看search的pool_size/queue_size
curl -X GET "127.0.0.1:9200/_cat/thread_pool/search?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name ip name active queue rejected completed core type pool_size queue_size
it_elk_node167 172.29.4.167 search 0 0 0 8418 fixed_auto_queue_size 61 1000
it_elk_node169 172.29.4.169 search 0 0 0 64923 fixed_auto_queue_size 61 1000
it_elk_node168 172.29.4.168 search 0 0 0 111409 fixed_auto_queue_size 61 1000
it_elk_node170 172.29.4.170 search 0 0 0 50890 fixed_auto_queue_size 61 1000
it_elk_node171 172.29.4.171 search 0 0 0 38820 fixed_auto_queue_size 61 1000
# 查看get的pool_size/queue_size
curl -X GET "127.0.0.1:9200/_cat/thread_pool/get?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name ip name active queue rejected completed core type pool_size queue_size
it_elk_node167 172.29.4.167 get 0 0 0 819 fixed 40 1000
it_elk_node169 172.29.4.169 get 0 0 0 331 fixed 40 1000
it_elk_node168 172.29.4.168 get 0 0 0 4756 fixed 40 1000
it_elk_node170 172.29.4.170 get 0 0 0 9 fixed 9 1000
it_elk_node171 172.29.4.171 get 0 0 0 6193 fixed 40 1000
# 查看ccr(跨集群复制)的pool_size/queue_size
curl -X GET "127.0.0.1:9200/_cat/thread_pool/ccr?v&h=node_name,ip,name,active,queue,rejected,completed,core,type,pool_size,queue_size"
node_name ip name active queue rejected completed core type pool_size queue_size
it_elk_node167 172.29.4.167 get 0 0 0 819 fixed 40 1000
it_elk_node169 172.29.4.169 get 0 0 0 331 fixed 40 1000
it_elk_node168 172.29.4.168 get 0 0 0 4756 fixed 40 1000
it_elk_node170 172.29.4.170 get 0 0 0 9 fixed 9 1000
it_elk_node171 172.29.4.171 get 0 0 0 6193 fixed 40 1000
# 查看所有thread pool
curl -XGET "127.0.0.1:9200/_cat/thread_pool?v&h=id,ip,name,queue,rejected,completed"
id ip name queue rejected completed
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 analyze 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 ccr 0 0 140182
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 fetch_shard_started 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 fetch_shard_store 0 0 60
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 flush 0 0 8700
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 force_merge 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 generic 0 0 1267594
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 get 0 0 158496
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 listener 0 0 16736
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 management 0 0 7373132
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 refresh 0 0 15758014
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 rollup_indexing 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 search 0 0 1474802
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 search_throttled 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 security-token-key 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 snapshot 0 0 2
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 transform_indexing 0 0 0
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 warmer 0 0 749996
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 watcher 0 0 23390
VbBadlehSNqlXKtWGlyUvA 172.29.4.156 write 0 0 34229769
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 analyze 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 ccr 0 0 70051
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 fetch_shard_started 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 fetch_shard_store 0 0 34
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 flush 0 0 8912
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 force_merge 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 generic 0 0 1688212
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 get 0 0 95389
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 listener 0 0 11281
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 management 0 0 11715898
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 refresh 0 0 15957993
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 rollup_indexing 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 search 0 0 884079
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 search_throttled 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 security-token-key 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 snapshot 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 transform_indexing 0 0 0
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 warmer 0 0 482303
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 watcher 0 0 46692
Py5AN5uKS7OWoDXwdr_ayw 172.29.4.157 write 0 0 50279093
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 analyze 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 ccr 0 0 70124
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 fetch_shard_started 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 fetch_shard_store 0 0 50
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 flush 0 0 14722
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 force_merge 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 generic 0 0 1473305
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 get 0 0 113200
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 listener 0 0 12675
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 management 0 0 27809756
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 refresh 0 0 12937481
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 rollup_indexing 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 search 0 0 987998
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 search_throttled 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 security-token-key 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 snapshot 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 transform_indexing 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 warmer 0 0 670607
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 watcher 0 0 0
wrKFBPZGQCm9kHxDAe8krw 172.29.4.158 write 0 0 69187927