Docs
uhadoop
Developer Guide
Hadoop Development Guide

Hadoop Developer’s Guide

Annotation: The scripts executed in this example should be run on a CentOS operating system. For other operating systems, please modify the scripts before attempting to execute them.

1. Create Hadoop Client Node

UHadoop provides client node and SSH two access modes, preferentially recommend client access mode, for specifics, see Cluster Access.

2. HDFS

HDFS is a highly fault-tolerant and high-throughput distributed file system. It is designed to be scalable and easy to use, suitable for storing massive files.

2.1 Basic HDFS Operations

  • Query Files
    Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path>]
    
  • Upload Files
    Usage: hadoop fs [generic options] -put [-f] [-p] [-l]
    <localsrc> ... <dst>
    
  • Download Files
    Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc]
    <src> ... <localdst>
    

For more details, refer to: hadoop fs -help

2.2 WebHDFS

WebHDFS provides the RESTful interface for HDFS, which can be used to operate HDFS files. When using WebHDFS, the client first accesses the Namenode node to get the address of the Datanode where the file is located, and then exchanges data with the Datanode node.

2.2.1 Upload File

UHadoop cluster is default configured with 2 Master nodes, only one node Namenode is in Active state at the same moment, another is in Standby state. Below uses Namenode of uhadoop-******-master1 in Active as an example.

  • Data Preparation

      touch uhadoop.txt
      echo "uhadoop" > uhadoop.txt
    
  • Create File Request

      curl -i -X PUT "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=CREATE"
    

    Annotation:

    1. Need to add the host of all nodes in the cluster to the machine executing this command
    2. If the prompt is Operation category READ is not supported in state standby, please replace with uhadoop-******-master2 to attempt

    The above command will get the Location address, which is the Datanode address of the file

    HTTP/1.1 307 TEMPORARY_REDIRECT
    Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
    Content-Length: 0
    
  • Upload File Using the Above Location Address

      curl -i -X PUT -T uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=CREATE&namenoderpcaddress=Ucluster&overwrite=false"
    
2.2.2 Append File
  • Data Preparation

      touch  append_uhadoop.txt
      echo "test_content" > append_uhadoop.txt
    
  • Get the Address of the File to be Appended

      curl -i -X POST "http://uhadoop-hfygbg-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=APPEND"
    

    The execution of the above command will get the Location address, which is the Datanode address of the file

    HTTP/1.1 307 TEMPORARY_REDIRECT
    Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
    Content-Length: 0
    
  • Append File

      curl -i -X POST -T append_uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=APPEND&namenoderpcaddress=Ucluster"
    
2.2.3 Open and Read Files
  curl -i -L "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=OPEN"
2.2.4 Delete Files
  curl -i -X DELETE "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"

2.3 HttpFS

Httpfs is an http interface for HDFS provided by cloudera, which can access HDFS for reading and writing through WebHDFS Restful API. The difference from WebHDFS is that Httpfs does not require clients to access each node of the cluster, but only needs to authorize access to a single machine that has started the Httpfs service (UHadoop defaults to start Httpfs on master1:14000). As Httpfs is a web application in the embedded tomcat, it will be somewhat constrained in performance.

2.3.1 Upload File
  • Data Preparation

      touch httpfs_uhadoop.txt
      echo "httpfs_uhadoop" > httpfs_uhadoop.txt
    
  • Upload Data

      curl -i -X PUT -T httpfs_uhadoop.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=CREATE&user.name=root&data=true"
    

    Annotation:

    1. Need to add the host of master1 in the cluster to the machine executing this command
    2. Need to add user.name in the url, otherwise will report “HTTP Status 401 - Authentication required” error
2.3.2 Append File
  • Data Preparation

      touch append_httpfs.txt
      echo "append_httpfs" > append_httpfs.txt
    
  • Append File

      curl -i -X POST -T append_httpfs.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=APPEND&user.name=root&data=true"
    
2.3.3 Open and Read File
  curl -i -L "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=OPEN&user.name=root"
  curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"
2.3.4 Delete File
  curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=DELETE&user.name=root"

2.4 MapReduce Job

Taking terasort as an example, to demonstrate how to submit a MapReduce Job.

  • Generate official terasort input dataset

      hadoop jar /home/hadoop/hadoop-examples.jar teragen 100 /tmp/terasort_input
    
  • Submit Task

      hadoop jar /home/hadoop/hadoop-examples.jar  terasort /tmp/terasort_input /tmp/terasort_output
    

2.5 HDFS Daily Operations

2.5.1 Restart Service

Restart Namenode: service hadoop-hdfs-namenode restart

Restart Datanode: service hadoop-hdfs-datanode restart

Restart ResourceManager: service hadoop-yarn-resourcemanager restart

Restart NodeManager: service hadoop-yarn-nodemanager restart

Restart the entire Hadoop service: Please operate it through the cluster service management page of the console.

2.5.2 Check HDFS status and node information
  hdfs dfsadmin -report
2.5.3 Modify the Number of Replicas of HDFS Files
  hdfs dfs -setrep -R [replication-factor] [targetDir]

Example: Modify the number of HDFS root directory file replicas to 2, hdfs dfs -setrep -R 2 /

2.5.4 View HDFS File System Status
  hadoop fsck /

The return result is shown as follows:

 Total size:    455660769497 B (Total open files size: 44723814 B)
 Total dirs:    47975
 Total files:   70456
 Total symlinks:        0 (Files currently being written: 11)
 Total blocks (validated):  69916 (avg. block size 6517260 B) (Total open file blocks (not validated): 10)
 Minimally replicated blocks:   69916 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   87 (0.12443504 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 3.0011585
 Corrupt blocks:        0
 Missing replicas:      522 (0.24815665 %)
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Thu Nov 24 16:08:12 CST 2016 in 2044 milliseconds

The filesystem under path '/' is HEALTHY

The above HEALTHY indicates that the current HDFS file system is normal, without bad blocks or data loss.