Saturday, 5 May 2018

Debugging Elasticsearch Cluster Going RED Or Yellow

How To Get Elasticsearch Cluster Back To Green From RED Or Yellow


Once we had our elastic search cluster going RED. Everything stopped working, it was a panic situation. We didn't knew what went wrong and it was late night. We started debugging and followed some steps to make out what went wrong. So In this post, we will be explaining the steps we took to recover the cluster to make you familiar with Elasticsearch debugging. This debugging process is not universal to debug any sort of issue related to ES but yeah, this can surely give you an idea.


Following are the steps we tried to debug and get the cluster back to normal:

First analyse cluster health by following command:

curl localhost:9200/_cat/health?v
This will tell you if cluster is red/yellow/green, how many unassigned/initialising/relocating shards and are there any pending tasks.
In our case the cluster health was RED and there were 2000 unassigned and some initialising shards which were not reducing over time which essentially means cluster was not recovering at all.
Reason for unassigned shards could be different like not enough disk space, less resource, bulk request etc. The aim was to now find out why there are unassigned shards and to reduce them.
Less disk space could result into some unassigned shards. So check overall cluster disk space which was hovering around 60% in our case, this means disk space is not an issue.


Also check node level disk space as one or more node might be at 85% level which is the default watermark level per node at which ES stops allocating shards on that node. You can check node level disk utilisation by the following command:

http://localhost:9200/_cat/allocation?v


Disk utilisation on each node was ok.
Now check Master node logs:
Tail current master node logs. If you don’t know which node is master node then run the following command, the node with asterisk * is the current elected master node
curl localhost:9500/_cat/nodes?v
In logs of master node, following errors/exception were noted:
master node logs
[2018-04-02 07:08:10,088][DEBUG][action.admin.indices.create] [ip-192-168-6-66] [myindex-2030-11-29] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index [myindex-2030-11-29], cause [auto(bulk api)]) within 1m
        at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:278)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
This clearly tells us that its failing to create indexes within fixed duration and the index creation is due to bulk api request. Also, the index getting created are future dated, which is not desired.
Next thing was to look at which all indexes have gone making the whole cluster state RED. Run following command to know those indexes:
curl -XGET 192.168.6.106:9200/_cat/shards | grep UNASSIGNED
or to know the reason as well for unassigned shards run following command:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
One general reason of unassigned shards could be "NODE LEFT". Master node may think a node is out of cluster if it doesn't acknowledge its presence in a specific time. Data node may fail to acknowledge due to many reasons like - process got stopped, stop the world GC happening frequently and taking too much of time. These things you can check on that particular data node logs.
In our case following indexes were unassigned and red (only few indexes are shown as example). These are all future dated


myindex-2021-06-30 4 r UNASSIGNED
myindex-2021-06-30 4 r UNASSIGNED
myindex-2021-06-30 4 p UNASSIGNED
myindex-2021-06-30 0 r UNASSIGNED
myindex-2021-06-30 0 r UNASSIGNED
myindex-2021-06-30 0 p UNASSIGNED
myindex-2021-06-30 3 r UNASSIGNED
myindex-2021-06-30 3 r UNASSIGNED
myindex-2021-06-30 3 p UNASSIGNED
myindex-2021-06-30 1 r UNASSIGNED
myindex-2021-06-30 1 r UNASSIGNED
We were pretty much sure that some client has uploaded data with csv with future dated events, hence it results in heavy index creations.


Now to recover the state, we stopped the writes on elasticsearch (since we are using kafka, we can afford stopping writes). If you can't stop the writes, just disable allocation.
After stopping writes, we disabled the allocation on cluster with the following command:
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable":"none"}}'
We also purged the cluster cache to clean up the memory and make the things works faster with the following command:
curl -XPOST localhost:9200/_cache/clear
Now the unassigned shards were getting into initialised state and pending tasks were reducing slowly.
Although unassigned one can be routed to a healthy node in case the cluster went to yellow because of an unhealthy node. You can use the following command to reroute:



curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '
{
   "commands" : [
      {
          "allocate" : {
              "index" : "myindex-2021-06-30", "shard" : 1,
              "node": "ip-192-168-6-165", "allow_primary": "true"
          }
      }
   ]
}'


You can also just trigger the routing of unassigned shards by the following command without specifying the node:

curl -XPOST localhost:9200/_cluster/reroute?retry_failed

This rerouting was not needed in our case as we were flexible to delete RED unassigned shards as we didn't wanted future dated events to be recorded.
To delete RED unassigned shards, run the following command:


#!/bin/bash
IFS=$'\n'
for line in $(curl -s 'localhost:9200/_cat/shards' | fgrep UNASSIGNED); do
 INDEX=$(echo $line | (awk '{print $1}'))
 SHARD=$(echo $line | (awk '{print $2}'))
 #echo $INDEX
 curl -XDELETE localhost:9200/$INDEX
done
When all unassigned shards were assigned and pending tasks reached 0, the cluster also became green. Then we restarted the allocation with the following command:
curl -XPUT 192.168.6.106:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable":"all"}}'
After this we started the application to write to elasticsearch cluster.
To make sure this doesn't happens again. A validation on application side was put to restrict future dated events to pass to elasticsearch.

0 comments:

Post a Comment

 

Copyright @ 2013 Appychip.

Designed by Appychip & YouTube Channel