Hadoop Developer’s Guide

Annotation: The scripts executed in this example should be run on a CentOS operating system. For other operating systems, please modify the scripts before attempting to execute them.

1. Create Hadoop Client Node

UHadoop provides client node and SSH two access modes, preferentially recommend client access mode, for specifics, see Cluster Access.

2. HDFS

HDFS is a highly fault-tolerant and high-throughput distributed file system. It is designed to be scalable and easy to use, suitable for storing massive files.

2.1 Basic HDFS Operations

Query Files


Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path>]

Upload Files


Usage: hadoop fs [generic options] -put [-f] [-p] [-l]
<localsrc> ... <dst>

Download Files


Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc]
<src> ... <localdst>

For more details, refer to: hadoop fs -help

2.2 WebHDFS

WebHDFS provides the RESTful interface for HDFS, which can be used to operate HDFS files. When using WebHDFS, the client first accesses the Namenode node to get the address of the Datanode where the file is located, and then exchanges data with the Datanode node.

2.2.1 Upload File

UHadoop cluster is default configured with 2 Master nodes, only one node Namenode is in Active state at the same moment, another is in Standby state. Below uses Namenode of uhadoop-******-master1 in Active as an example.

Data Preparation


  touch uhadoop.txt
  echo "uhadoop" > uhadoop.txt

Create File Request
```
  curl -i -X PUT "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=CREATE"
```
Annotation：
1. Need to add the host of all nodes in the cluster to the machine executing this command
2. If the prompt is Operation category READ is not supported in state standby, please replace with uhadoop-******-master2 to attempt
The above command will get the Location address, which is the Datanode address of the file
```
HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
```

Upload File Using the Above Location Address


  curl -i -X PUT -T uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=CREATE&namenoderpcaddress=Ucluster&overwrite=false"

2.2.2 Append File

Data Preparation


  touch  append_uhadoop.txt
  echo "test_content" > append_uhadoop.txt

Get the Address of the File to be Appended


  curl -i -X POST "http://uhadoop-hfygbg-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=APPEND"

The execution of the above command will get the Location address, which is the Datanode address of the file


HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0

Append File


  curl -i -X POST -T append_uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=APPEND&namenoderpcaddress=Ucluster"

2.2.3 Open and Read Files


  curl -i -L "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=OPEN"

2.2.4 Delete Files


  curl -i -X DELETE "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"

2.3 HttpFS

Httpfs is an http interface for HDFS provided by cloudera, which can access HDFS for reading and writing through WebHDFS Restful API. The difference from WebHDFS is that Httpfs does not require clients to access each node of the cluster, but only needs to authorize access to a single machine that has started the Httpfs service (UHadoop defaults to start Httpfs on master1:14000). As Httpfs is a web application in the embedded tomcat, it will be somewhat constrained in performance.

2.3.1 Upload File

Data Preparation


  touch httpfs_uhadoop.txt
  echo "httpfs_uhadoop" > httpfs_uhadoop.txt

Upload Data
```
  curl -i -X PUT -T httpfs_uhadoop.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=CREATE&user.name=root&data=true"
```
Annotation：
1. Need to add the host of master1 in the cluster to the machine executing this command
2. Need to add user.name in the url, otherwise will report “HTTP Status 401 - Authentication required” error

2.3.2 Append File

Data Preparation


  touch append_httpfs.txt
  echo "append_httpfs" > append_httpfs.txt

Append File


  curl -i -X POST -T append_httpfs.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=APPEND&user.name=root&data=true"

2.3.3 Open and Read File


  curl -i -L "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=OPEN&user.name=root"
  curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"

2.3.4 Delete File


  curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=DELETE&user.name=root"

2.4 MapReduce Job

Taking terasort as an example, to demonstrate how to submit a MapReduce Job.

Generate official terasort input dataset


  hadoop jar /home/hadoop/hadoop-examples.jar teragen 100 /tmp/terasort_input

Submit Task


  hadoop jar /home/hadoop/hadoop-examples.jar  terasort /tmp/terasort_input /tmp/terasort_output

2.5 HDFS Daily Operations

2.5.1 Restart Service

Restart Namenode: service hadoop-hdfs-namenode restart

Restart Datanode: service hadoop-hdfs-datanode restart

Restart ResourceManager: service hadoop-yarn-resourcemanager restart

Restart NodeManager: service hadoop-yarn-nodemanager restart

Restart the entire Hadoop service: Please operate it through the cluster service management page of the console.

2.5.2 Check HDFS status and node information


  hdfs dfsadmin -report

2.5.3 Modify the Number of Replicas of HDFS Files


  hdfs dfs -setrep -R [replication-factor] [targetDir]

Example: Modify the number of HDFS root directory file replicas to 2, hdfs dfs -setrep -R 2 /

2.5.4 View HDFS File System Status


  hadoop fsck /

The return result is shown as follows:


 Total size:    455660769497 B (Total open files size: 44723814 B)
 Total dirs:    47975
 Total files:   70456
 Total symlinks:        0 (Files currently being written: 11)
 Total blocks (validated):  69916 (avg. block size 6517260 B) (Total open file blocks (not validated): 10)
 Minimally replicated blocks:   69916 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   87 (0.12443504 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 3.0011585
 Corrupt blocks:        0
 Missing replicas:      522 (0.24815665 %)
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Thu Nov 24 16:08:12 CST 2016 in 2044 milliseconds

The filesystem under path '/' is HEALTHY

The above HEALTHY indicates that the current HDFS file system is normal, without bad blocks or data loss.