Instance Operation and Maintenance

Healthy running and fault handling of clusters in production environment requires rational and efficient operation experience. This chapter will provide some practical online operation instances.

Cluster Health

In Elasticsearch clusters, many pieces of information can be monitored and analyzed, of which the most important are: Cluster Health (Cluster Health). Its status has three categories: `Green`, `Yellow`, and `Red`.


GET /_cluster/health

The system returns:


{
  "cluster_name" : "ues-qwerty",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 1,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

One of the response information blocks is `status`, which is the field we should pay the most attention to, it tells us whether the current cluster is in an available state. The three colors represent:


Status	Description	Note
`green`	All primary shards and secondary shards are available	All primary shards and replica shards have been allocated. The cluster is 100% available.
`yellow`	All primary shards are available, but there are unavailable secondary shards, i.e., there are unassigned secondary shards	All primary shards have already been sharded, but at least one replica is missing. There will be no data loss, so the search results are still complete. However, your high availability is somewhat weakened. If more shards disappear, data will be lost. Think of `yellow` as a warning that needs to be investigated in a timely manner.
`red`	There are unavailable primary shards, i.e., there are unassigned primary shards	At least one primary shard (and all its replicas) are missing. This means you are missing data: searches can only return partial data, and write requests assigned to this shard will return an exception.

To ensure the cluster is in `green` status, i.e., all primary and secondary shards are available. It is strongly recommended to set the primary shard of the index to less than three times the number of data nodes, and the replica to less than the number of data nodes.

Single Node Monitoring

Cluster health is a high-level summary of all the information in the cluster, while the node statistics API provides an array of dizzying statistics, including the statistics of each node in the cluster.

The Node Statistics API can be executed with the following command:


GET /_nodes/stats

At the beginning of the output content, we can see the cluster name and our first node:


{
    "_nodes": {
        "total": 3,
        "successful": 3,
        "failed": 0
    },
    "cluster_name": "ues-qwerty",
    "nodes": {
        "cZabQfdFTVCOdh9Zxrjocg": {
            "timestamp": 1515997798238,
            "name": "ues-qwerty-01",
            "transport_address": "192.168.1.1:9300",
            "host": "192.168.1.1",
            "ip": "192.168.1.1:9300",
            "roles": [
                "master",
                "data",
                "ingest"
            ],
            "indices": {
...

Nodes are arranged in a hash with the UUID of the node as the key. It also shows some information about the node’s network attributes, which is useful for debugging automatic discovery problems such as node not joining the cluster. Typically, you’ll find that the port is incorrect, or the node is bound to the wrong IP address/network interface.

Index part

The Indices part lists the aggregated statistical values of all indices on this node ：


    "indices": {
        "docs": {
           "count": 0,
           "deleted": 0
        },
        "store": {
           "size_in_bytes": 0,
           "throttle_time_in_millis": 0
        },

* docs displays how many documents are in the node’s memory, including the number of deleted documents that have not yet been cleared from the segment.

* The store section displays how much physical storage the node consumes. This indicator includes both primary shards and replica shards. If the throttle time is large, it may indicate that your disk throttle setting is too low.


        "indexing": {
                    "index_total": 2,
                    "index_time_in_millis": 146,
                    "index_current": 0,
                    "index_failed": 0,
                    "delete_total": 1,
                    "delete_time_in_millis": 6,
                    "delete_current": 0,
                    "noop_update_total": 0,
                    "is_throttled": false,
                    "throttle_time_in_millis": 0
                },
                "get": {
                    "total": 2,
                    "time_in_millis": 7,
                    "exists_total": 2,
                    "exists_time_in_millis": 7,
                    "missing_total": 0,
                    "missing_time_in_millis": 0,
                    "current": 0
                },
                "search": {
                    "open_contexts": 0,
                    "query_total": 50,
                    "query_time_in_millis": 74,
                    "query_current": 0,
                    "fetch_total": 40,
                    "fetch_time_in_millis": 25,
                    "fetch_current": 0,
                    "scroll_total": 0,
                    "scroll_time_in_millis": 0,
                    "scroll_current": 0,
                    "suggest_total": 0,
                    "suggest_time_in_millis": 0,
                    "suggest_current": 0
                },
                "merges": {
                    "current": 0,
                    "current_docs": 0,
                    "current_size_in_bytes": 0,
                    "total": 0,
                    "total_time_in_millis": 0,
                    "total_docs": 0,
                    "total_size_in_bytes": 0,
                    "total_stopped_time_in_millis": 0,
                    "total_throttled_time_in_millis": 0,
                    "total_auto_throttle_in_bytes": 104857600
                },

* indexing shows how many documents have been indexed. This value is a cumulative counter, which does not decrease when documents are deleted and increases when internal index operations such as document updates occur. In addition, it lists the time consumed by indexing operations, the number of documents being indexed, and similar statistics of deletion operations.

* get displays the statistics related to the document acquisition interface by ID. This includes GET and HEAD requests for individual documents.

* search displays the number of active searches (open_contexts), the total number of queries, and the total time spent on queries since the node started. The ratio calculated by query_time_in_millis / query_total can be used to roughly evaluate how efficient your queries are. The higher the ratio, the more time each query spends, and you should consider tuning.


The fetch statistics show the second half of the query processing process (fetch in query-then-fetch). If fetch takes more time than query, it indicates that the disk is slow, or too many documents are retrieved, or possibly the search request has set too large pagination (for example, size: 10000).

* merges includes information related to Lucene segment merging. It will tell you how many mergers are currently running, the number of documents involved in the merger, the total size of the segments being merged, and the total time spent on the merger operations.


When your cluster has a large write pressure, the merge statistics are very important. Merging consumes a lot of disk I/O and CPU resources. If your index has a lot of writes, and you find a lot of merges, be sure to read _**[Index Performance Tips](https://www.elastic.co/guide/cn/elasticsearch/guide/current/indexing-performance.html)**_.


        "filter_cache": {
           "memory_size_in_bytes": 48,
           "evictions": 0
        },
        "fielddata": {
           "memory_size_in_bytes": 0,
           "evictions": 0
        },
        "segments": {
           "count": 319,
           "memory_in_bytes": 65812120
        },
        ...

* filter_cache displays the amount of memory used by the cached filter bitsets and the number of times the filter has been evicted from memory. Too many eviction counts may suggest that you need to increase the size of your filter cache, or that your filters are not well suited for caching (for example, they are generated in large quantities due to high cardinality, like caching a now time expression).


However, the eviction count is a difficult metric to judge. Filters are cached on a per segment basis, and evicting filters from a small segment is much cheaper than from a large one. It is possible that you have a large number of evictions, but they all occur on small segments, which means they have only a small impact on query performance. Take the eviction count as a coarse reference. If you see a large number, check your filters to make sure they are normally cached. Continually evicted filters, even if they all occur on very small segments, perform much worse than correctly cached filters.

* field_data displays the memory used by fielddata for aggregation, sorting, etc. There is also an eviction count here. Unlike filter_cache, the eviction count here is useful: this count should be or at least close to 0. Because fielddata is not cached, any eviction is costly and should be avoided. If you see evictions here, you need to reassess your memory situation, fielddata limits, request statements, or all three.

* segments will display the number of Lucene segments currently being served by this node. This is an important figure. Most indices will have about 50–150 segments, even if they hold billions of documents at TB level. An excessive number of segments indicates that there is a problem with merging (for example, the speed of merging cannot keep up with the creation of segments). Note that this statistic is the total of all indices on the node.


The memory statistics show the size of the memory used by the Lucene segments themselves. This includes underlying data structures, such as inverted tables, dictionaries, and bloom filters, etc. A too large number of segments will increase the overhead brought by these data structures, and this memory usage is a convenient measure to measure the overhead.

Operating System and Process Part

The `OS` and `Process` sections are mostly self-explanatory and will not be detailed. They list basic resource statistics, such as CPU and load. The OS section describes the entire operating system, while the Process part only shows the resource usage of the Elasticsearch JVM process.

These are all very useful indicators, but they are usually already measured in your monitoring technology stack. The statistics include the following:

* CPU

* Load

* Memory usage rate

* Swap usage rate

* Open file descriptors

JVM section

The `jvm` section includes some very critical information about the JVM process running Elasticsearch. Most importantly, it includes details about garbage collection, which has a significant impact on the stability of your Elasticsearch cluster.


            "jvm": {
                "timestamp": 1515997798245,
                "uptime_in_millis": 1704518268,
                "mem": {
                    "heap_used_in_bytes": 143523960,
                    "heap_used_percent": 3,
                    "heap_committed_in_bytes": 4277534720,
                    "heap_max_in_bytes": 4277534720,
                    "non_heap_used_in_bytes": 95476824,
                    "non_heap_committed_in_bytes": 102416384,

Thread Pool Part

Elasticsearch maintains internal thread pools. These thread pools work together to complete tasks, and if necessary, they will pass tasks to each other. In general, you don’t need to configure or tune thread pools, but it can be useful to examine their stats to see how your cluster is performing.

There is a series of thread pools, but with identical output format:


  "index": {
     "threads": 1,
     "queue": 0,
     "active": 0,
     "rejected": 0,
     "largest": 1,
     "completed": 1
  }

Each thread pool will list the configured thread count (threads), the current number of threads processing tasks (active), and the number of task units waiting to be processed in the queue (queue).

If the number of job units in the queue reaches its limit, new units will start to be rejected, which you can see on the rejected counter. This is usually a signal that your cluster is bottlenecked on some resources. Because a full queue indicates that your node or cluster is running at a high rate but is still overwhelmed by work.

The thread pool sections to focus on are:

* indexing

The thread pool for ordinary indexing requests

* bulk

Bulk requests, different thread pool than single index requests

* get

Get-by-ID operations

* search

All search and query requests

* merging

A thread pool dedicated to managing Lucene merges

File System and Network Part

Continuing to read the /_node/stats API, you will see a string of statistics related to your file system: available space, data directory paths, disk I/O statistics, and so on. If you are not monitoring disk space available, you can get these statistics from here. Disk I/O statistics are also convenient, but usually specialized command line tools (such as iostat) are more useful.

Evidently, it’s hard for Elasticsearch to run when the disk space is full - so make sure it doesn’t.

There are also two sections related to network statistics:


            "transport": {
                "server_open": 26,
                "rx_count": 8834703,
                "rx_size_in_bytes": 19706396482,
                "tx_count": 8834702,
                "tx_size_in_bytes": 18212792172
            },
            "http": {
                "current_open": 3,
                "total_opened": 153
            },

* transport shows some basic statistics related to the transport address, including node-to-node communications (usually on port 9300) and connections from any transport client or node client. Don’t worry if you see a lot of connections here; Elasticsearch maintains many connections between nodes.

* http displays statistics for the HTTP port (usually 9200). If you see the number of total_opened rising and continues to rise, it is a clear signal that keep-alive long connections are not enabled in your HTTP client. Continuous keep-alive long connections are important for performance, as connecting and disconnecting sockets is expensive (and wastes file descriptors). Please make sure your client configurations are correct.

Circuit Breaker

Fielddata circuit breaker-related statistics:


                "fielddata": {
                    "limit_size_in_bytes": 2566520832,
                    "limit_size": "2.3gb",
                    "estimated_size_in_bytes": 0,
                    "estimated_size": "0b",
                    "overhead": 1.03,
                    "tripped": 0
                },

Here you can see the maximum value of the circuit breaker (for example, the circuit breaker is triggered when a request applies for more memory). This section also lets you know how many times the circuit breaker has been triggered and the current configuration overhead. Indirect overhead is used for cushioning assessment, as some requests are more difficult to assess than others.

The main focus should be on the tripped indicator. If this number is large or continues to rise, it is a signal that your requests need optimization, or you need to add more memory (on a single machine or by adding new nodes).

Cluster Statistics

The Cluster Statistics API provides the similar output to node statistics. But there is a significant difference: node statistics display the statistics of each node, while cluster statistics display the total value for each metric of all nodes.

There are some very worthwhile statistics here. For instance, you can see that the entire cluster is using 50% of the heap memory, or filter cache eviction is not severe. This interface is mainly used to provide a more detailed overview than the cluster health but less detailed than the node statistics. It’s also very useful for very large clusters, as the output of node statistics becomes difficult to read then.

This API can be called like this:


GET /_cluster/stats

Index Statistics

Sometimes we want to look at statistics from an index-centric perspective: how many search requests did the index receive? How much time does the index take to get a document?

To do this, select the index (or multiple indexes) you are interested in and then execute an index-level statistics API:


# Statistics of my_index index
GET /my_index/_stats 

# Using comma-separated index names can request multiple index statistics
GET /my_index,another_index/_stats 

# Using the specific _all can request the statistics of all indexes
GET /_all/_stats

The returned statistics are similar to the output of node statistics: search, fetch, get, index, bulk, segment counts, etc.

Index-centric statistics are useful at times, for instance, to identify or verify the hot indexes in the cluster, or to try to find out why some indexes are faster or slower than others.

In practice, node-centric statistics seem to be more useful. Bottlenecks are often about the whole node, not a single index. Because indexes are generally distributed on multiple nodes, this makes the index-centric statistics usually not very useful, because they are data aggregated from physical machines in different environments.

Index-centric statistics can be preserved as a useful tool in your skill set, but it’s usually not the first tool to use.

Pending Tasks

There are some tasks that can only be handled by the master node, such as creating a new index or moving shards within a cluster. Since there can only be one master node in a cluster, only this node can process changes in cluster-level metadata. In 99.9999% of the time, this is not a problem. The queue of metadata changes is basically kept at zero.

In some rare clusters, the number of metadata changes is faster than the master node can handle. This will cause awaiting operations to accumulate into a queue.

The Pending Tasks API gives you a display of queued (if any) cluster-level metadata change operations waiting:


GET /_cluster/pending_tasks

Typically, the response is as follows:


{
   "tasks": []
}

cat API

The cat API can be very useful for accessing some cluster information.

By sending a cat command via a GET request, you can list all available APIs:


GET /_cat


=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates

Dynamic Changes to Settings

Many settings in Elasticsearch are dynamic and can be changed through APIs. Configurations that require a forced restart of the node (or cluster) should be avoided as much as possible. Moreover, although these changes can also be completed through static configuration items, we recommend using APIs.

The cluster update API has two modes of operation:

* Temporary (Transient)

These changes will take effect until the cluster restarts. Once the entire cluster restarts, these settings are cleared.

* Permanent (Persistent)

These changes will persist until explicitly modified. They will survive a full cluster restart and override the options in the static configuration file.

Transient or permanent configurations need to be specified separately in the JSON body:


PUT /_cluster/settings
{
    "persistent" : {
        "discovery.zen.minimum_master_nodes" : 2 
    },
    "transient" : {
        "indices.store.throttle.max_bytes_per_sec" : "50mb" 
    }
}

# This persistent setting will survive a full cluster restart
# This temporary setting will be removed once the entire cluster is restarted for the first time

Logging

Elasticsearch outputs a lot of logs, and the default logging level is INFO. It provides a moderate amount of information, but it’s designed so that your logs don’t get too large.

When debugging problems, especially node discovery related issues (because they often rely on various complex network configurations), increasing the logging level to DEBUG is helpful.

Increase the log level for node discovery:


PUT /_cluster/settings
{
    "transient" : {
        "logger.discovery" : "DEBUG"
    }
}

Slow Logs

There is also another log called the Slow Log. The purpose of this log is to capture queries and index requests that exceed a specified time threshold. This log is useful for tracking very slow requests generated by users.

By default, the slow log is turned off. To enable it, you need to define the specific action (query, fetch, or index), the event log level you expect (WARN, DEBUG, etc.), and the time threshold.

This is a index-level setting, which means it can be applied independently to individual indexes:


PUT /my_index/_settings
{
    "index.search.slowlog.threshold.query.warn" : "10s", 
    "index.search.slowlog.threshold.fetch.debug": "500ms", 
    "index.indexing.slowlog.threshold.index.info": "5s" 
}

# Queries slower than 10 seconds output a WARN log
# Fetches slower than 500 milliseconds output a DEBUG log
# Indexing slower than 5 seconds output an INFO log

Once the threshold is set, you can switch the log level like any other logger:


PUT /_cluster/settings
{
    "transient" : {
        "logger.index.search.slowlog" : "DEBUG", 
        "logger.index.indexing.slowlog" : "WARN" 
    }
}

# Set the search slow log to DEBUG level
# Set the indexing slow log to WARN level

Historical Data Cleaning

As the data continues to be written, the disk usage of the nodes will continuously increase and exceed a certain threshold, the shards cannot be allocated. One solution is to delete some of the index data to keep the disk usage of the cluster nodes in a reasonable state.

Deleting historical data can be achieved by UHost or UES’s Kibana with network communication. The API can delete a class of indexes or specific indexes using wildcards, and scripts can also be used for regular cleaning.

Get cluster index information:


# UHost
curl -s -XGET 'http://<host>:9200/_cat/indices?v'

# Kibana
GET /_cat/indices?v

Delete Index:


# UHost
curl -s -XDELETE 'http://<host>:9200/index'

# Kibana
DELETE /index

Scheduled cleaning script example (this script is suitable for situations where the logstash writes the index with timestamp, other situations can be modified by reference):


#!/bin/bash
#created at 2017/08/30 19:00

if test ! -f "/var/log/elkDailyDel.log"; then
    touch /var/log/elkDailyDel.log
fi

# Get index information
# Please replace localhost in the next line with the IP corresponding to your own Elasticsearch service
indices=$(curl -s -XGET "localhost:9200/_cat/indices?v" |grep 'logstash' |awk '{print $3}')

# Set the timeline for preserving index
thirtyDaysAgo=$(date -d "$(date +%Y%m%d) -30 days" "+%s")
function DelOrNot(){
    if [ $(($1-$2)) -ge 0 ]; then
        echo 1
    else
        echo 0
    fi
}

for index in ${indices}
do
    indexDate=`echo ${index##*-} |sed 's/\./-/g'`
    indexTime=`date -d "${indexDate}" "+%s"`
    if [ `DelOrNot ${indexTime} ${thirtyDaysAgo}` -eq 0 ]; then
        # If the index time is before the timeline, execute the delete operation
        # Please replace localhost in the next line with your own Elasticsearch service IP
        result=`curl -s -XDELETE "localhost:9200/${index}"`
        echo "delResult is ${result}" >> /var/log/elkDailyDel.log
        if [ `echo ${result} |grep 'acknowledged' |wc -l` -eq 1 ]; then
            echo "${index} had already been deleted!" >> /var/log/elkDailyDel.log
        else
            echo "there is something wrong happend when deleted ${index}" >> /var/log/elkDailyDel.log
        fi
    fi
done

echo $?

Data Hot and Cold Separation

With data continuously written in, the disk data of nodes in the SSD disk type cluster gradually increases. It is possible that for some older data, there are not many queries or no longer in use, then these data will occupy a large amount of SSD performance resources and storage space. Data hot and cold separation through configuration allows the latest data to be stored on SSD disk nodes, and older data automatically migrates to cheaper SATA nodes, using limited SSD node resources to support high concurrent read and write, and large data storage.

UES has also provided a strategy for data hot and cold separation, the process is as follow:

* Preparation

Cluster instance:

SSD (original cluster): ues-qwerty, the first three node IPs are ues-qwerty-ip1, ues-qwerty-ip2, ues-qwerty-ip3 respectively

SATA (new cluster): ues-asdfgh

The original cluster (ues-qwerty) modifies the configuration

Modify the ** node.attr.tag ** item in Parameter Management in the console to ** hot **.

Linearly restart the original cluster (ues-qwerty)
Fix the original index on the original cluster nodes (ues-qwerty)


# UHost
curl -XPUT localhost:9200/*/_settings -d '{"settings": {"index.routing.allocation.require.tag": "hot"}}'

# Kibana
PUT /*/_settings 
{ 
  "settings": { 
    "index.routing.allocation.require.tag": "hot"
  } 
}

Modify the configuration of the new cluster (ues-asdfgh), the same as 1

Cluster.name: ues-qwerty

Discovery.zen.ping.unicast.hosts: [ues-qwerty-ip1,ues-qwerty-ip2,ues-qwerty-ip3]

Node.attr.tag: cold

Restart the new cluster (ues-asdfgh)
Old data to be migrated to cold nodes


# UHost
curl -XPUT localhost:9200/cold_index/_settings -d '{"settings": {"index.routing.allocation.require.tag": "cold"}}'

# Kibana
PUT /cold_index/_settings 
{ 
  "settings": { 
    "index.routing.allocation.require.tag": "cold"
  } 
}

Scheduled migration script example:


#!/bin/bash
#created at 2017/08/30 19:00

time=`date -d last-day "+%Y.%m.%d"`

# Please replace localhost in the next line with your own elasticsearch service's IP
curl -XPUT http://localhost:9200/*-${time}/_settings -d'
{
  "index.routing.allocation.require.tag": "cold"
}'

This chapter is mainly referenced from:

《Elasticsearch: Authoritative Guide》