Troubleshooting

Task Execution Failure

1. View console output logs

Check the log output from the console during task execution to see if there are any ERRORs.

2. View task execution logs

If the task is run in the background or scheduled, you will first need to know the ID of the failed task. You can then analyze the error message in the log in detail.

Tips: Common task failure ERRORs can be found in Common Task ERRORs

Troubleshooting Tools

1. Monitor viewing

- View cluster or node monitoring data on the “Monitoring View” page of the clusters to determine if there are any abnormalities.

2. View service logs

- There are logs of various services under /var/log on each node - you can view the logging information of the task through the web-yarn page or hue

Fault Description

1. Problem Description

When submitting a fault to technical support, you can include the following content in the submission information to quickly locate the fault:

- Cluster identifier - Region and availability zone where the cluster is started - How to operate to produce this exception - Detailed description of the exception phenomenon

2. Check the modification of cluster configuration

- Check whether the last correct running configuration and environment variables were modified.

3. Check the logs

Tasks that are usually submitted can be seen in the Hadoop-yarn interface. If you can’t view tasks, there are usually the following situations:

- Spark tasks are submitted locally - Hive tasks are submitted locally (Hive-server2 will run some smaller tasks locally by default)

Cluster Running Slowly

1. Check cluster configuration modifications

2. Check the logs

- Check the task logs. If one or more tasks have failed, please investigate the logs of the corresponding task attempts to learn more detailed error information. - Check the service logs, in the /var/log directory of each node, each service has its own archive directory. When the cluster runs slowly, obvious abnormalities or operations that take a long time can usually be found in the logs.

3. Check the running status of cluster nodes

- master: Manages various services deployed on the cluster. If the master node encounters a performance issue, the entire cluster will be affected. - core: Handles map-reduce tasks, maintains the Hadoop Distributed File System (HDFS), and the HBase regionserver. - task: Handles map-reduce tasks. These are purely computational resources and do not store data. You can add task nodes to the cluster to increase performance speed, or remove unnecessary task nodes.

Note: The tasks running on the task nodes will obtain data from the core nodes through the network, so in some cases, increasing the number of task nodes can’t shorten the running time of the task.

4. Check input data

- Please check your input data. Is it evenly distributed between keys and values? If your data is heavily skewed towards one or a few key values, it may map the processing load to a small number of nodes, while other nodes are idle. The uneven distribution of work may lead to slower processing speeds.

An example of an unbalanced dataset is running a cluster based on words sorted alphabetically, but you have a dataset that contains only words starting with the letter “a”. When the work is mapped, the nodes processing words that start with “a” will be overloaded, while nodes processing words that start with other letters will be idle.