Skip to Content
Developer GuideHadoop Development Guide

Hadoop Developer’s Guide

Annotation: The scripts executed in this example should be run on a CentOS operating system. For other operating systems, please modify the scripts before attempting to execute them.

1. Create Hadoop Client Node

UHadoop provides client node and SSH two access modes, preferentially recommend client access mode, for specifics, see Cluster Access.

2. HDFS

HDFS is a highly fault-tolerant and high-throughput distributed file system. It is designed to be scalable and easy to use, suitable for storing massive files.

2.1 Basic HDFS Operations

  • Query Files
    Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path>]
  • Upload Files
    Usage: hadoop fs [generic options] -put [-f] [-p] [-l] <localsrc> ... <dst>
  • Download Files
    Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>

For more details, refer to: hadoop fs -help

2.2 WebHDFS

WebHDFS provides the RESTful interface for HDFS, which can be used to operate HDFS files. When using WebHDFS, the client first accesses the Namenode node to get the address of the Datanode where the file is located, and then exchanges data with the Datanode node.

2.2.1 Upload File

UHadoop cluster is default configured with 2 Master nodes, only one node Namenode is in Active state at the same moment, another is in Standby state. Below uses Namenode of uhadoop-******-master1 in Active as an example.

  • Data Preparation

    touch uhadoop.txt echo "uhadoop" > uhadoop.txt
  • Create File Request

    curl -i -X PUT "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=CREATE"

    Annotation:

    1. Need to add the host of all nodes in the cluster to the machine executing this command
    2. If the prompt is Operation category READ is not supported in state standby, please replace with uhadoop-******-master2 to attempt

    The above command will get the Location address, which is the Datanode address of the file

    HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE... Content-Length: 0
  • Upload File Using the Above Location Address

    curl -i -X PUT -T uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=CREATE&namenoderpcaddress=Ucluster&overwrite=false"
2.2.2 Append File
  • Data Preparation

    touch append_uhadoop.txt echo "test_content" > append_uhadoop.txt
  • Get the Address of the File to be Appended

    curl -i -X POST "http://uhadoop-hfygbg-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=APPEND"

    The execution of the above command will get the Location address, which is the Datanode address of the file

    HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE... Content-Length: 0
  • Append File

    curl -i -X POST -T append_uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=APPEND&namenoderpcaddress=Ucluster"
2.2.3 Open and Read Files
curl -i -L "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=OPEN"
2.2.4 Delete Files
curl -i -X DELETE "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"

2.3 HttpFS

Httpfs is an http interface for HDFS provided by cloudera, which can access HDFS for reading and writing through WebHDFS Restful API. The difference from WebHDFS is that Httpfs does not require clients to access each node of the cluster, but only needs to authorize access to a single machine that has started the Httpfs service (UHadoop defaults to start Httpfs on master1:14000). As Httpfs is a web application in the embedded tomcat, it will be somewhat constrained in performance.

2.3.1 Upload File
  • Data Preparation

    touch httpfs_uhadoop.txt echo "httpfs_uhadoop" > httpfs_uhadoop.txt
  • Upload Data

    curl -i -X PUT -T httpfs_uhadoop.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=CREATE&user.name=root&data=true"

    Annotation:

    1. Need to add the host of master1 in the cluster to the machine executing this command
    2. Need to add user.name in the url, otherwise will report “HTTP Status 401 - Authentication required” error
2.3.2 Append File
  • Data Preparation

    touch append_httpfs.txt echo "append_httpfs" > append_httpfs.txt
  • Append File

    curl -i -X POST -T append_httpfs.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=APPEND&user.name=root&data=true"
2.3.3 Open and Read File
curl -i -L "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=OPEN&user.name=root" curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"
2.3.4 Delete File
curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=DELETE&user.name=root"

2.4 MapReduce Job

Taking terasort as an example, to demonstrate how to submit a MapReduce Job.

  • Generate official terasort input dataset

    hadoop jar /home/hadoop/hadoop-examples.jar teragen 100 /tmp/terasort_input
  • Submit Task

    hadoop jar /home/hadoop/hadoop-examples.jar terasort /tmp/terasort_input /tmp/terasort_output

2.5 HDFS Daily Operations

2.5.1 Restart Service

Restart Namenode: service hadoop-hdfs-namenode restart

Restart Datanode: service hadoop-hdfs-datanode restart

Restart ResourceManager: service hadoop-yarn-resourcemanager restart

Restart NodeManager: service hadoop-yarn-nodemanager restart

Restart the entire Hadoop service: Please operate it through the cluster service management page of the console.

2.5.2 Check HDFS status and node information
hdfs dfsadmin -report
2.5.3 Modify the Number of Replicas of HDFS Files
hdfs dfs -setrep -R [replication-factor] [targetDir]

Example: Modify the number of HDFS root directory file replicas to 2, hdfs dfs -setrep -R 2 /

2.5.4 View HDFS File System Status
hadoop fsck /

The return result is shown as follows:

Total size: 455660769497 B (Total open files size: 44723814 B) Total dirs: 47975 Total files: 70456 Total symlinks: 0 (Files currently being written: 11) Total blocks (validated): 69916 (avg. block size 6517260 B) (Total open file blocks (not validated): 10) Minimally replicated blocks: 69916 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 87 (0.12443504 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0011585 Corrupt blocks: 0 Missing replicas: 522 (0.24815665 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Thu Nov 24 16:08:12 CST 2016 in 2044 milliseconds The filesystem under path '/' is HEALTHY

The above HEALTHY indicates that the current HDFS file system is normal, without bad blocks or data loss.