Docs
ues
Elasticsearch
Development Guide
Fault recovery

Fault Recovery

Recovery from yellow state

The yellow state indicates the presence of unassigned replica shards.

Query index status

curl -s -XGET 'http://<host>:9200/_cat/indices?v'
curl -s -XGET 'http://<host>:9200/_cluster/health?level=indices'

Query unassigned shards

curl -s -XGET 'http://<host>:9200/_cat/shards?v' | grep UNASSIGNED
curl -s -XGET 'http://<host>:9200/_cluster/health?level=shards'

* Unreasonable index replica setting

If the number of indexed replicas is set to be greater than the number of data nodes, leading to the cluster being in a yellow state, adjust the number of replicas to rectify the cluster status.

curl -XPUT \
http://<host>:9200/unassigned_index/_settings \
-H 'Content-Type: application/json' \
-d '{
    "index": {
        "number_of_replicas": replicasCount
    }
}'

# unassigned_index is the index of the unassigned shard
# replicasCount is the new number of index replicas

Under normal circumstances, unassigned replica shards will be automatically assigned and the cluster status will recover to green. Under special circumstances, it might be necessary to manually assign unassigned replica shards.

curl -XPOST \
http://<host>:9200/_cluster/reroute \
-H 'Content-Type: application/json' \
-d '{
    "commands": [{
        "allocate_replica": {
            "index": "unassigned_index",
            "shard": num,
            "node": "nodeName"
        }
    }]
}'

# unassigned_index is the index of the unassigned shard
# num is the sequence number of the unassigned shard
# nodeName is the node name, or can be the node ID, such as kVWViI1PQt2Bk2rP7PlrbQ

The cluster will attempt to allocate a maximum of index.allocation.max_retries time slices in a row (default is 5) before giving up and leaving the shard. This situation might be caused by structural problems, such as a analyzer referring to a stop word file that does not exist on any node. Once this problem has been resolved, manual retry allocation can be done by calling retry_failed on the Reroute API, which will attempt to retry these shards once.

POST /_cluster/reroute?retry_failed=true

Worse cluster situations may result in unassigned primary shards. For manual allocation, refer to Reroute.

* Node disk usage exceeds threshold

For the impact of node disk usage on shard allocation, refer to Disk-based Shard Allocation.

If high disk usage is causing the cluster to have unassigned shards, consider modifying the disk usage policy for temporary relief, or increase the number of nodes.

Additionally, if it is certain that some historical index data can be permanently deactivated, the cluster status can be restored by deleting such indices.