Hadoop Developer’s Guide
Annotation: The scripts executed in this example should be run on a CentOS operating system. For other operating systems, please modify the scripts before attempting to execute them.
1. Create Hadoop Client Node
UHadoop provides client node and SSH two access modes, preferentially recommend client access mode, for specifics, see Cluster Access.
2. HDFS
HDFS is a highly fault-tolerant and high-throughput distributed file system. It is designed to be scalable and easy to use, suitable for storing massive files.
2.1 Basic HDFS Operations
- Query Files
Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path>]
- Upload Files
Usage: hadoop fs [generic options] -put [-f] [-p] [-l] <localsrc> ... <dst>
- Download Files
Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>
For more details, refer to: hadoop fs -help
2.2 WebHDFS
WebHDFS provides the RESTful interface for HDFS, which can be used to operate HDFS files. When using WebHDFS, the client first accesses the Namenode node to get the address of the Datanode where the file is located, and then exchanges data with the Datanode node.
2.2.1 Upload File
UHadoop cluster is default configured with 2 Master nodes, only one node Namenode is in Active state at the same moment, another is in Standby state. Below uses Namenode of uhadoop-******-master1 in Active as an example.
-
Data Preparation
touch uhadoop.txt echo "uhadoop" > uhadoop.txt
-
Create File Request
curl -i -X PUT "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=CREATE"
Annotation:
- Need to add the host of all nodes in the cluster to the machine executing this command
- If the prompt is Operation category READ is not supported in state standby, please replace with uhadoop-******-master2 to attempt
The above command will get the Location address, which is the Datanode address of the file
HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE... Content-Length: 0
-
Upload File Using the Above Location Address
curl -i -X PUT -T uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=CREATE&namenoderpcaddress=Ucluster&overwrite=false"
2.2.2 Append File
-
Data Preparation
touch append_uhadoop.txt echo "test_content" > append_uhadoop.txt
-
Get the Address of the File to be Appended
curl -i -X POST "http://uhadoop-hfygbg-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=APPEND"
The execution of the above command will get the Location address, which is the Datanode address of the file
HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE... Content-Length: 0
-
Append File
curl -i -X POST -T append_uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=APPEND&namenoderpcaddress=Ucluster"
2.2.3 Open and Read Files
curl -i -L "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=OPEN"
2.2.4 Delete Files
curl -i -X DELETE "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"
2.3 HttpFS
Httpfs is an http interface for HDFS provided by cloudera, which can access HDFS for reading and writing through WebHDFS Restful API. The difference from WebHDFS is that Httpfs does not require clients to access each node of the cluster, but only needs to authorize access to a single machine that has started the Httpfs service (UHadoop defaults to start Httpfs on master1:14000). As Httpfs is a web application in the embedded tomcat, it will be somewhat constrained in performance.
2.3.1 Upload File
-
Data Preparation
touch httpfs_uhadoop.txt echo "httpfs_uhadoop" > httpfs_uhadoop.txt
-
Upload Data
curl -i -X PUT -T httpfs_uhadoop.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=CREATE&user.name=root&data=true"
Annotation:
- Need to add the host of master1 in the cluster to the machine executing this command
- Need to add user.name in the url, otherwise will report “HTTP Status 401 - Authentication required” error
2.3.2 Append File
-
Data Preparation
touch append_httpfs.txt echo "append_httpfs" > append_httpfs.txt
-
Append File
curl -i -X POST -T append_httpfs.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=APPEND&user.name=root&data=true"
2.3.3 Open and Read File
curl -i -L "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=OPEN&user.name=root"
curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"
2.3.4 Delete File
curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=DELETE&user.name=root"
2.4 MapReduce Job
Taking terasort as an example, to demonstrate how to submit a MapReduce Job.
-
Generate official terasort input dataset
hadoop jar /home/hadoop/hadoop-examples.jar teragen 100 /tmp/terasort_input
-
Submit Task
hadoop jar /home/hadoop/hadoop-examples.jar terasort /tmp/terasort_input /tmp/terasort_output
2.5 HDFS Daily Operations
2.5.1 Restart Service
Restart Namenode: service hadoop-hdfs-namenode restart
Restart Datanode: service hadoop-hdfs-datanode restart
Restart ResourceManager: service hadoop-yarn-resourcemanager restart
Restart NodeManager: service hadoop-yarn-nodemanager restart
Restart the entire Hadoop service: Please operate it through the cluster service management page of the console.
2.5.2 Check HDFS status and node information
hdfs dfsadmin -report
2.5.3 Modify the Number of Replicas of HDFS Files
hdfs dfs -setrep -R [replication-factor] [targetDir]
Example: Modify the number of HDFS root directory file replicas to 2, hdfs dfs -setrep -R 2 /
2.5.4 View HDFS File System Status
hadoop fsck /
The return result is shown as follows:
Total size: 455660769497 B (Total open files size: 44723814 B)
Total dirs: 47975
Total files: 70456
Total symlinks: 0 (Files currently being written: 11)
Total blocks (validated): 69916 (avg. block size 6517260 B) (Total open file blocks (not validated): 10)
Minimally replicated blocks: 69916 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 87 (0.12443504 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0011585
Corrupt blocks: 0
Missing replicas: 522 (0.24815665 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Thu Nov 24 16:08:12 CST 2016 in 2044 milliseconds
The filesystem under path '/' is HEALTHY
The above HEALTHY indicates that the current HDFS file system is normal, without bad blocks or data loss.