UHadoop - XXXCloud

Hive Development Guide

Hive is a data warehouse product in the Hadoop ecosystem. It can easily store, query, and analyze data that is stored in HDFS or HBase, converting SQL statements into MapReduce tasks for complex massive data analysis. It also provides a series of tools that can be used to extract, transform, and load data.

1. Hive Cli

Hive Cli is a client provided by the Hive service for managing Hive tables. The basic operations are as follows:

Open Hive Cli

On any UHadoop master node, or on a client node where Hive is installed and deployed, execute hive:


[root@uhadoop-******-master1 ~]#hive

Create a hive table


  hive> create table test_hive (id int, name string);

Insert data


  hive> insert into test_hive values (1,'test_ul'),(2,'test_hive');

Read data


  hive> select * from test_hive;

Count the number of data


  hive> select count(*) from test_hive;

Directly execute SQL commands from the command line


  hive -e "select * from test_hive"

2. Beeline

Hive provides a service called Hive-server2 (which is installed on the master2 node by default in UHadoop and serves on port 10000) that can be called via JDBC.

With the beeline client you can remotely connect to the Hive-server2 service and perform operations on hive data.

Start beeline client


[root@uhadoop-******-master1 ~]# beeline

Connect to hive-server2


beeline> !connect jdbc:hive2://uhadoop-******-master2:10000/default;

Annotation:

Username, password can be left blank by default
uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node

Data operation

Same as Hive Cli

Directly submit SQL command in command line:


beeline -ujdbc:hive2://uhadoop-******-master2:10000  -e "select * from test_hive"

3. Hive Application Development

3.1 Connecting to HiveServer2 Using JAVA (implementing table creation, data loading, and data display)

This example requires you to first log in to the UHadoop cluster master1 node, the following operations are performed by default on the master1 node

org.apache.hive.jdbc.HiveDriver is the driver name of hiveserver2, and the access address of hiveserver2 is “jdbc:hive2://ip:10000/default”

Write sample code

The sample code Hive2JdbcClient.java is as follows:


import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
 
public class Hive2JdbcClient {
    private static String driverName = "org.apache.hive.jdbc.HiveDriver";
 
    /**
     * @param args
     * @throws SQLException
     */
    public static void main(String[] args) throws SQLException {
        try {
            Class.forName(driverName);
        } catch (ClassNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
            System.exit(1);
        }
        //replace "hive" here with the name of the user the queries should run as
        Connection con = DriverManager.getConnection("jdbc:hive2://uhadoop-******-master2:10000/default", "", "");
        Statement stmt = con.createStatement();
        String tableName = "testHive2DriverTable";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (key int, value string)");
        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
            System.out.println(res.getString(1));
        }
        // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1) + "\t" + res.getString(2));
        }
 
        // load data into table
        // NOTE: filepath has to be local to the hive server
        // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
        String filepath = "/user/hive/warehouse/b.txt";
        sql = "load data inpath '" + filepath + "' into table " + tableName;
        System.out.println("Running: " + sql);
        stmt.execute(sql);
 
        // select * query
        sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
                System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
        }
 
        // regular hive query
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}

Annotation:

Connection con = DriverManager.getConnection(“jdbc:hive2://uhadoop-******-master2:10000/default”, "", "");

uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node

Compile


  javac Hive2JdbcClient.java

Run program

test.sh code is as follows


#!/bin/bash
hdfs dfs -rm /user/hive/warehouse/b.txt
echo -e '1\x01foo' > /tmp/b.txt
echo -e '2\x01bar' >> /tmp/b.txt
hdfs dfs -put /tmp/b.txt /user/hive/warehouse/
HADOOP_HOME=/home/hadoop/
CLASSPATH=.:$HIVE_HOME/conf
for i in ${HADOOP_HOME}/share/hadoop/mapreduce/lib/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/common/lib/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/common/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done

for i in ${HIVE_HOME}/lib/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH Hive2JdbcClient

3.2 Connecting to HiveServer2 Using Python (implementing table creation, data loading, data display)

The process of using the python client in Hiveserver2 is as follows:

Download pyhs2 git clone https://github.com/BradRuderman/pyhs2.git
Install dependencies yum install gcc-c++ cyrus-sasl-* python-devel
Install setuptools wget -q http://peak.telecommunity.com/dist/ez_setup.py ./python ez_setup.py

If the above installation fails, you need to manually download the setuptools-0.6c11.tar.gz installation package to install

Compile and install pyhs2

Enter the pyhs2 directory and install


cd pyhs2
python setup.py build
python setup.py install

Write sample code

Sample code, that is, example.py under pyhs2


import pyhs2
 
with pyhs2.connect(host='uhadoop-******-master2',
                   port=10000,
                   authMechanism="PLAIN",
                   user='root',
                   password='test',
                   database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()
 
        #Execute query
        cur.execute("select * from test_hive")
 
        #Return column info from query
        print cur.getSchema()
 
        #Fetch table results
        for i in cur.fetch():
            print i

uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node, test_hive should be replaced with an existing table name in hive

3.3 Hive external table reads HBase data

By creating an HBase external table in Hive, you can use simple SQL statements to analyze unstructured data in HBase

Open HBase shell, create table t1


create 't1',{NAME => 'f1',VERSIONS => 2}
put 't1','rowkey001','f1:col1','value01' 
put 't1','rowkey001','f1:col2','value02' 
put 't1','rowkey001','f1:colf','value03' 
scan 't1'

The structure of the obtained t1 table is as follows


hbase(main):013:0> scan 't1'
ROW                                            COLUMN+CELL
 rowkey001                                     column=f1:col1, timestamp=1481075364575, value=value01
 rowkey001                                     column=f1:col2, timestamp=1481075364607, value=value02
 rowkey001                                     column=f1:colf, timestamp=1481075364641, value=value03

Open Hive Cli, create an external table


hive> CREATE EXTERNAL TABLE t_hive_hbase(
    > rowkey string,
    > cf map<STRING,STRING>
    > )
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:")
    > TBLPROPERTIES ("hbase.table.name" = "t1");

Use SQL statement to read hbase data, the result is as follows


hive> select * from  t_hive_hbase;
OK
rowkey001   {"col1":"value01","col2":"value02","colf":"value03"}

4. Common Enhanced Features

4.1 Supports custom UDF functions

The following are two methods for writing UDF functions, the first one is through Eclipse, and the second one is directly compiling on the UHadoop cluster’s master node using shell scripts.

Writing custom Hive UDF function in Eclipse

1 Create a new java project in eclipse: hiveudf

2 Add a folder lib to the hiveudf work, and the lib package needs to copy the related jar package.

These four packages can be found in the following locations on uhadoop-******-master1


/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0-cdh5.4.9.jar
/home/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.6.0-cdh5.4.9.jar
/home/hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.4.9.jar
/home/hadoop/hive/lib/hive-exec-1.2.1.jar

3 Add all jar packages in the lib directory to the build Path

4 Create a UDFLower class and inherit from the UDF class of hive, please leave the package of the new UDFLower class blank


import org.apache.hadoop.hive.ql.exec.UDF;  
import org.apache.hadoop.io.Text;  
 
public class UDFLower extends UDF{  
     public Text evaluate(final Text s){  
          if (null == s){  
              return null;  
          }  
          return new Text(s.toString().toLowerCase());  
      }  
}

5 Package UDFLower.jar

— Execute export

— Choose JAR file

— Modify the export directory name and click Finish, and the UDFLower.jar is ready

— Use ssh command to upload to uhadoop master1 node

If master1 is already bound to the external network, it can be uploaded directly via the external network IP. If not, please jump through the machine with an external IP.

Writing custom Hive UDF function in Linux

Create code


  cd /data
  mkdir hive_udf
  touch UDFLower.java

UDFLower.java code is as follows


import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
 
public class UDFLower extends UDF{
     public Text evaluate(final Text s){
          if (null == s){
              return null;
          }
          return new Text(s.toString().toLowerCase());
      }
}

Create compile.sh file in the same directory


#!/bin/bash

rm -f UDFLower.jar
rm -f UDFLower*.class
export HADOOP_MAPRED_HOME=/home/hadoop
CLASSPATH=.:$HIVE_HOME/conf

for i in /home/hadoop/share/hadoop/common/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done

for i in /home/hadoop/hive/lib/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
javac -cp $CLASSPATH UDFLower.java
jar cvf UDFLower.jar UDFLower.class

Executing sh compile.sh would generate UDFLower.jar in the same directory

Using the generated UDFLower.jar for Hive example

Enter the directory where UDFLower.jar is located, upload to HDFS directory

hadoop fs -put UDFLower.jar /tmp/
Data preparation

Test file test.txt, content is as follows


HELLO WORLD HEHE

Upload to HDFS


hadoop fs -put test.txt /tmp/

Create the relevant data table in Hive Cli and load the data


hive> create table test_hive_udf (name string);
hive> load data inpath  '/test/1/test.txt' into table test_hive_udf;

Create a temporary function


hive> create temporary function my_lower as 'UDFLower';

Use the custom UDF


hive> select name from test_hive_udf;
hive> select my_lower(name) from test_hive_udf;

4.2 Supports json format data

Data preparation

Upload test.json data to hdfs

test.json data is as follows


{"id":1234544353,"text":"test_hive_json"}

Upload data to HDFS


hdfs dfs -put test.json /tmp/test.json

Load the dependency package

Hive parses json format data depending on hive-json-serde.jar, if you use beeline, you don’t need to load the corresponding jar package through add jar. If you are using the hive cli interface, you need to add jar, do as follows:


hive> add jar $HIVE_HOME/lib/hive-json-serde.jar;

hive 1.2.1 version does not provide this package yet

Create a hive table


hive> CREATE TABLE test_json(id BIGINT,text STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS TEXTFILE;

Load data


hive> LOAD DATA  INPATH "/tmp/test.json" OVERWRITE INTO TABLE test_json;

Run query

Returning the following indicates that the json file parsing was successful


hive> select * from test_json;
OK
1234544353  test_hive_json

4.3 Using regular match

Data preparation

Upload test data nginx_log to hdfs

nginx_log data is as follows


180.173.250.74 168.34.34

Upload to HDFS


hdfs dfs -put nginx_log /tmp/nginx_log

Load the dependency package

Hive’s regular matching uses depends on hive-contrib.jar, if you use beeline, you don’t need to load the corresponding jar package through add jar. If you are using the hive cli interface, you need to add jar, do as follows:


hive> add jar $HIVE_HOME/lib/hive-contrib.jar;

Create a test table


hive> CREATE TABLE logs(
       host STRING,
       identity STRING)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
         WITH SERDEPROPERTIES (
         "input.regex" = "([^ ]*) ([^ ]*)",
         "output.format.string" = "%1$s %2$s"
)STORED AS TEXTFILE;

Annotation: When creating a table, you need to specify ROW FORMAT SERDE, SERDEPROPERTIES’ input.regex and output.format.string

Load data


hive> LOAD DATA INPATH "/tmp/nginx_log" INTO TABLE logs;

Test data

Returning the following indicates that the regular expression was successfully used


hive> select * from logs;
OK
180.173.250.74  168.34.34