Hadoop

Why does it prompt “Operation category READ is not supported in state standby” when accessing HDFS data?

Since the Master configuration in UHadoop is HA, there are 2 NameNodes, one of which is an Active node at any given moment and the other is Standby. Memory spikes or network fluctuations can trigger a master-slave switch. It’s not advisable for the client to access HDFS data using the Master node’s IP.

Correct usage: If the machine where you’re using has the UHadoop client installed (refer to hadoopdev#Install the Hadoop client on UHost), you can access it directly through hadoop fs -ls / or hadoop fs -ls hdfs://Ucluster/.

If client code is being used, you can copy the cluster’s /home/hadoop/conf/hdfs-site.xml and /home/hadoop/conf/core-site.xml to your local program and load the two files through conf.addResource. Then, you can access HDFS data through hdfs://Ucluster/.

Why is the actual available space in HDFS smaller than the configured amount?

The core node needs to launch the node-manager service, thus storing some local data. This takes up extra space. The core1 node is partially occupied by zookeeper and jornal-node, and thus has even less space available. The standard is that 90% of the space is set aside for HDFS.

Why is only localhost configured in /home/hadoop/etc/hadoop/slaves instead of specifying other node IPs?

The slaves file is akin to a whitelist mechanism. If this file isn’t configured, newly added core nodes can join the cluster by default. New nodes join the cluster correctly by reading the namenode-related information in the hdfs-site.xml file. The nodes of other users or inaccessible nodes via network cannot join the cluster.

The regionservers file configuration for Hbase is the same.

How to clean up if the Hadoop Recycle Bin takes up large space and the data files are not properly cleaned?

The .Trash folder’s default configuration is to check every 5 days. Files under the .Trash folder that are over 5 days old will be deleted. If these files are less than 5 days old, they’ll be placed in folders named after their age (i.e., ”/.Trash/yyMMddHHmm”) and will be deleted when the next check occurs after 5 days. Therefore, .Trash stores files for 5-10 days.

You can change the following two parameters to change the storage and check times:


fs.trash.interval	File storage time
fs.trash.checkpoint.interval	File check time, defaults to “fs.trash.interval”

Hadoop lzo can’t find the Native library?

- Error 1: Could not load native gpl library

Ensure that the current client’s environment variable “LD_LIBRARY_PATH” is the same as the cluster’s.

- Error 2: java.lang.RuntimeException: native-lzo library not available

This error occurs because the machine running the task has not installed lzo-devel, causing the program to be unable to find liblzo2.so.2. Install it on the machine with the following command:


yum install lzo lzo-devel

How to adjust the Configuration of the task node?

To facilitate management, configuration of task nodes must be kept uniform.

So when you need to adjust the task node configuration, the only solution is to delete the existing task node and choose a new type.

Note：

Deleting a task node will affect currently running tasks;
Users need to backup their data on the node to be deleted;

Why is the memory allocated to a task more than the 1000MB set by the user?

To facilitate resource management and scheduling, Yarn has built-in resource normalization algorithms. It sets the minimum and maximum quantities of resources that can be requested and the resource normalization factor. If the resources requested by an application are less than the minimum allowable value, Yarn increases the request to that minimum amount. The resources obtained by the application will never be less than what it requested, but it may be more; if the resources requested by the application are more than the maximum allowable value, an exception is raised and the request fails. The normalization factor normalizes resources. If the resources requested by an application are not divisible by the factor, the request is changed to the smallest integer multiple of the factor. The formula is ceil(a/b)*b, where a represents the resources requested by the application and b represents the normalization factor.

These parameters should be set in yarn-site.xml and the related parameters include:

- yarn.scheduler.minimum-allocation-mb: minimum memory request, default is 1024

- yarn.scheduler.minimum-allocation-vcores: minimum CPU request, default is 1

- yarn.scheduler.maximum-allocation-mb: maximum memory request, default is 8096

- yarn.scheduler.maximum-allocation-vcores: maximum CPU request, default is 4

How to locate corrupt files in task logs?

If the uploaded file is compressed and damaged, the task execution will fail. You can locate the damaged file by checking the task log.


  - Find the errored task on the web-yarn interface;
  - Click to view the detailed information of the task;
  - Open the history link of the task and find the failed Mapper;
  - View the specific failed mapper and find out which file it was processing.

There are two ways to avoid this problem:


  - If the file is not significantly impactful to the result, you can skip the error. This can be done by specifying mapreduce.map.skip.maxrecords when assigning the task, allowing the task to continue;
  - Use another compression format. As the gzip compression requires the entire file to be complete in order to be decompressed, it is recommended to use the lzo format. Even if parts of the files are damaged, it can ensure the task continues running.

Does data need to be balanced after adding core nodes to the cluster?

After successfully adding nodes, the system will automatically balance the data. If after a long period of time the data in the cluster is still unbalanced, you can submit a data balancing request on the “Cluster Management” page:

You can also submit the balancing command on the master node:


/home/hadoop/sbin/start-balancer.sh -threshold 10

The threshold is the target parameter to judge whether the cluster is balanced. The default value is 10. This means that when the differential of the percentage of available disk capacity on all core nodes in the cluster is less than 10, the data balancing process exits.

What if HDFS data reading response is slow?

If you find HDFS data reading response is slow, showing


WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost

You can check the following aspects:


  - Check disk IO;
  - Examine the GC situation of your node;
  - Check network bandwidth;

HDFS concurrent write failure?

HDFS supports concurrent reading and reading/writing, but writing cannot be concurrent. Only one client can write to a file at a time, and multiple clients cannot write to HDFS simultaneously. This is because when a client obtains permission from NameNode to write a block to DataNode, the block will be locked until the operation is completed.