Docs
uhadoop
Developer Guide
Hive Development Guide

Hive Development Guide

Hive is a data warehouse product in the Hadoop ecosystem. It can easily store, query, and analyze data that is stored in HDFS or HBase, converting SQL statements into MapReduce tasks for complex massive data analysis. It also provides a series of tools that can be used to extract, transform, and load data.

1. Hive Cli

Hive Cli is a client provided by the Hive service for managing Hive tables. The basic operations are as follows:

  • Open Hive Cli

On any UHadoop master node, or on a client node where Hive is installed and deployed, execute hive:

[root@uhadoop-******-master1 ~]#hive
  • Create a hive table
  hive> create table test_hive (id int, name string);
  • Insert data
  hive> insert into test_hive values (1,'test_ul'),(2,'test_hive');
  • Read data
  hive> select * from test_hive;
  • Count the number of data
  hive> select count(*) from test_hive;
  • Directly execute SQL commands from the command line
  hive -e "select * from test_hive"

2. Beeline

Hive provides a service called Hive-server2 (which is installed on the master2 node by default in UHadoop and serves on port 10000) that can be called via JDBC.

With the beeline client you can remotely connect to the Hive-server2 service and perform operations on hive data.

  • Start beeline client
[root@uhadoop-******-master1 ~]# beeline
  • Connect to hive-server2
beeline> !connect jdbc:hive2://uhadoop-******-master2:10000/default;

Annotation:

  1. Username, password can be left blank by default
  2. uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node
  • Data operation

Same as Hive Cli

  • Directly submit SQL command in command line:
beeline -ujdbc:hive2://uhadoop-******-master2:10000  -e "select * from test_hive"

3. Hive Application Development

3.1 Connecting to HiveServer2 Using JAVA (implementing table creation, data loading, and data display)

This example requires you to first log in to the UHadoop cluster master1 node, the following operations are performed by default on the master1 node

org.apache.hive.jdbc.HiveDriver is the driver name of hiveserver2, and the access address of hiveserver2 is “jdbc:hive2://ip:10000/default”

  • Write sample code

The sample code Hive2JdbcClient.java is as follows:

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
 
public class Hive2JdbcClient {
    private static String driverName = "org.apache.hive.jdbc.HiveDriver";
 
    /**
     * @param args
     * @throws SQLException
     */
    public static void main(String[] args) throws SQLException {
        try {
            Class.forName(driverName);
        } catch (ClassNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
            System.exit(1);
        }
        //replace "hive" here with the name of the user the queries should run as
        Connection con = DriverManager.getConnection("jdbc:hive2://uhadoop-******-master2:10000/default", "", "");
        Statement stmt = con.createStatement();
        String tableName = "testHive2DriverTable";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (key int, value string)");
        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
            System.out.println(res.getString(1));
        }
        // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1) + "\t" + res.getString(2));
        }
 
        // load data into table
        // NOTE: filepath has to be local to the hive server
        // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
        String filepath = "/user/hive/warehouse/b.txt";
        sql = "load data inpath '" + filepath + "' into table " + tableName;
        System.out.println("Running: " + sql);
        stmt.execute(sql);
 
        // select * query
        sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
                System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
        }
 
        // regular hive query
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}

Annotation:

  1. Connection con = DriverManager.getConnection(“jdbc:hive2://uhadoop-******-master2:10000/default”, "", "");
  2. uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node
  • Compile
  javac Hive2JdbcClient.java
  • Run program

test.sh code is as follows

#!/bin/bash
hdfs dfs -rm /user/hive/warehouse/b.txt
echo -e '1\x01foo' > /tmp/b.txt
echo -e '2\x01bar' >> /tmp/b.txt
hdfs dfs -put /tmp/b.txt /user/hive/warehouse/
HADOOP_HOME=/home/hadoop/
CLASSPATH=.:$HIVE_HOME/conf
for i in ${HADOOP_HOME}/share/hadoop/mapreduce/lib/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/common/lib/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/common/hadoop-*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done

for i in ${HIVE_HOME}/lib/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH Hive2JdbcClient

3.2 Connecting to HiveServer2 Using Python (implementing table creation, data loading, data display)

The process of using the python client in Hiveserver2 is as follows:

If the above installation fails, you need to manually download the setuptools-0.6c11.tar.gz installation package to install

  • Compile and install pyhs2

Enter the pyhs2 directory and install

cd pyhs2
python setup.py build
python setup.py install
  • Write sample code

Sample code, that is, example.py under pyhs2

import pyhs2
 
with pyhs2.connect(host='uhadoop-******-master2',
                   port=10000,
                   authMechanism="PLAIN",
                   user='root',
                   password='test',
                   database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()
 
        #Execute query
        cur.execute("select * from test_hive")
 
        #Return column info from query
        print cur.getSchema()
 
        #Fetch table results
        for i in cur.fetch():
            print i

uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node, test_hive should be replaced with an existing table name in hive

3.3 Hive external table reads HBase data

By creating an HBase external table in Hive, you can use simple SQL statements to analyze unstructured data in HBase

  • Open HBase shell, create table t1
create 't1',{NAME => 'f1',VERSIONS => 2}
put 't1','rowkey001','f1:col1','value01' 
put 't1','rowkey001','f1:col2','value02' 
put 't1','rowkey001','f1:colf','value03' 
scan 't1' 

The structure of the obtained t1 table is as follows

hbase(main):013:0> scan 't1'
ROW                                            COLUMN+CELL
 rowkey001                                     column=f1:col1, timestamp=1481075364575, value=value01
 rowkey001                                     column=f1:col2, timestamp=1481075364607, value=value02
 rowkey001                                     column=f1:colf, timestamp=1481075364641, value=value03
  • Open Hive Cli, create an external table
hive> CREATE EXTERNAL TABLE t_hive_hbase(
    > rowkey string,
    > cf map<STRING,STRING>
    > )
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:")
    > TBLPROPERTIES ("hbase.table.name" = "t1");
  • Use SQL statement to read hbase data, the result is as follows
hive> select * from  t_hive_hbase;
OK
rowkey001   {"col1":"value01","col2":"value02","colf":"value03"}

4. Common Enhanced Features

4.1 Supports custom UDF functions

The following are two methods for writing UDF functions, the first one is through Eclipse, and the second one is directly compiling on the UHadoop cluster’s master node using shell scripts.

Writing custom Hive UDF function in Eclipse

1 Create a new java project in eclipse: hiveudf

2 Add a folder lib to the hiveudf work, and the lib package needs to copy the related jar package.

These four packages can be found in the following locations on uhadoop-******-master1

/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0-cdh5.4.9.jar
/home/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.6.0-cdh5.4.9.jar
/home/hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.4.9.jar
/home/hadoop/hive/lib/hive-exec-1.2.1.jar

3 Add all jar packages in the lib directory to the build Path

4 Create a UDFLower class and inherit from the UDF class of hive, please leave the package of the new UDFLower class blank

import org.apache.hadoop.hive.ql.exec.UDF;  
import org.apache.hadoop.io.Text;  
 
public class UDFLower extends UDF{  
     public Text evaluate(final Text s){  
          if (null == s){  
              return null;  
          }  
          return new Text(s.toString().toLowerCase());  
      }  
} 

5 Package UDFLower.jar

— Execute export

— Choose JAR file

— Modify the export directory name and click Finish, and the UDFLower.jar is ready

— Use ssh command to upload to uhadoop master1 node

If master1 is already bound to the external network, it can be uploaded directly via the external network IP. If not, please jump through the machine with an external IP.

Writing custom Hive UDF function in Linux

  • Create code

Log in to uhadoop-******-master1 node, go to /data directory, create UDF code

  cd /data
  mkdir hive_udf
  touch UDFLower.java

UDFLower.java code is as follows

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
 
public class UDFLower extends UDF{
     public Text evaluate(final Text s){
          if (null == s){
              return null;
          }
          return new Text(s.toString().toLowerCase());
      }
}
  • Create compile.sh file in the same directory
#!/bin/bash

rm -f UDFLower.jar
rm -f UDFLower*.class
export HADOOP_MAPRED_HOME=/home/hadoop
CLASSPATH=.:$HIVE_HOME/conf

for i in /home/hadoop/share/hadoop/common/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done

for i in /home/hadoop/hive/lib/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
javac -cp $CLASSPATH UDFLower.java
jar cvf UDFLower.jar UDFLower.class

Executing sh compile.sh would generate UDFLower.jar in the same directory

Using the generated UDFLower.jar for Hive example

  • Enter the directory where UDFLower.jar is located, upload to HDFS directory

    hadoop fs -put UDFLower.jar /tmp/

  • Data preparation

Test file test.txt, content is as follows

HELLO WORLD HEHE

Upload to HDFS

hadoop fs -put test.txt /tmp/
  • Create the relevant data table in Hive Cli and load the data
hive> create table test_hive_udf (name string);
hive> load data inpath  '/test/1/test.txt' into table test_hive_udf;
  • Create a temporary function
hive> create temporary function my_lower as 'UDFLower';
  • Use the custom UDF
hive> select name from test_hive_udf;
hive> select my_lower(name) from test_hive_udf;

4.2 Supports json format data

  • Data preparation

Upload test.json data to hdfs

test.json data is as follows

{"id":1234544353,"text":"test_hive_json"}

Upload data to HDFS

hdfs dfs -put test.json /tmp/test.json
  • Load the dependency package

Hive parses json format data depending on hive-json-serde.jar, if you use beeline, you don’t need to load the corresponding jar package through add jar. If you are using the hive cli interface, you need to add jar, do as follows:

hive> add jar $HIVE_HOME/lib/hive-json-serde.jar;

hive 1.2.1 version does not provide this package yet

  • Create a hive table
hive> CREATE TABLE test_json(id BIGINT,text STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS TEXTFILE;
  • Load data
hive> LOAD DATA  INPATH "/tmp/test.json" OVERWRITE INTO TABLE test_json;
  • Run query

Returning the following indicates that the json file parsing was successful

hive> select * from test_json;
OK
1234544353  test_hive_json

4.3 Using regular match

  • Data preparation

Upload test data nginx_log to hdfs

nginx_log data is as follows

180.173.250.74 168.34.34

Upload to HDFS

hdfs dfs -put nginx_log /tmp/nginx_log

  • Load the dependency package

Hive’s regular matching uses depends on hive-contrib.jar, if you use beeline, you don’t need to load the corresponding jar package through add jar. If you are using the hive cli interface, you need to add jar, do as follows:

hive> add jar $HIVE_HOME/lib/hive-contrib.jar;
  • Create a test table
hive> CREATE TABLE logs(
       host STRING,
       identity STRING)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
         WITH SERDEPROPERTIES (
         "input.regex" = "([^ ]*) ([^ ]*)",
         "output.format.string" = "%1$s %2$s"
)STORED AS TEXTFILE;

Annotation: When creating a table, you need to specify ROW FORMAT SERDE, SERDEPROPERTIES’ input.regex and output.format.string

  • Load data
hive> LOAD DATA INPATH "/tmp/nginx_log" INTO TABLE logs;
  • Test data

Returning the following indicates that the regular expression was successfully used

hive> select * from logs;
OK
180.173.250.74  168.34.34