Hive Development Guide
Hive is a data warehouse product in the Hadoop ecosystem. It can easily store, query, and analyze data that is stored in HDFS or HBase, converting SQL statements into MapReduce tasks for complex massive data analysis. It also provides a series of tools that can be used to extract, transform, and load data.
1. Hive Cli
Hive Cli is a client provided by the Hive service for managing Hive tables. The basic operations are as follows:
- Open Hive Cli
On any UHadoop master node, or on a client node where Hive is installed and deployed, execute hive:
[root@uhadoop-******-master1 ~]#hive
- Create a hive table
hive> create table test_hive (id int, name string);
- Insert data
hive> insert into test_hive values (1,'test_ul'),(2,'test_hive');
- Read data
hive> select * from test_hive;
- Count the number of data
hive> select count(*) from test_hive;
- Directly execute SQL commands from the command line
hive -e "select * from test_hive"
2. Beeline
Hive provides a service called Hive-server2 (which is installed on the master2 node by default in UHadoop and serves on port 10000) that can be called via JDBC.
With the beeline client you can remotely connect to the Hive-server2 service and perform operations on hive data.
- Start beeline client
[root@uhadoop-******-master1 ~]# beeline
- Connect to hive-server2
beeline> !connect jdbc:hive2://uhadoop-******-master2:10000/default;
Annotation:
- Username, password can be left blank by default
- uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node
- Data operation
Same as Hive Cli
- Directly submit SQL command in command line:
beeline -ujdbc:hive2://uhadoop-******-master2:10000 -e "select * from test_hive"
3. Hive Application Development
3.1 Connecting to HiveServer2 Using JAVA (implementing table creation, data loading, and data display)
This example requires you to first log in to the UHadoop cluster master1 node, the following operations are performed by default on the master1 node
org.apache.hive.jdbc.HiveDriver is the driver name of hiveserver2, and the access address of hiveserver2 is “jdbc:hive2://ip:10000/default”
- Write sample code
The sample code Hive2JdbcClient.java is as follows:
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class Hive2JdbcClient {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
//replace "hive" here with the name of the user the queries should run as
Connection con = DriverManager.getConnection("jdbc:hive2://uhadoop-******-master2:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHive2DriverTable";
stmt.execute("drop table if exists " + tableName);
stmt.execute("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/user/hive/warehouse/b.txt";
sql = "load data inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
stmt.execute(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}
Annotation:
- Connection con = DriverManager.getConnection(“jdbc:hive2://uhadoop-******-master2:10000/default”, "", "");
- uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node
- Compile
javac Hive2JdbcClient.java
- Run program
test.sh code is as follows
#!/bin/bash
hdfs dfs -rm /user/hive/warehouse/b.txt
echo -e '1\x01foo' > /tmp/b.txt
echo -e '2\x01bar' >> /tmp/b.txt
hdfs dfs -put /tmp/b.txt /user/hive/warehouse/
HADOOP_HOME=/home/hadoop/
CLASSPATH=.:$HIVE_HOME/conf
for i in ${HADOOP_HOME}/share/hadoop/mapreduce/lib/hadoop-*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/common/lib/hadoop-*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/share/hadoop/common/hadoop-*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH Hive2JdbcClient
3.2 Connecting to HiveServer2 Using Python (implementing table creation, data loading, data display)
The process of using the python client in Hiveserver2 is as follows:
-
Download pyhs2 git clone https://github.com/BradRuderman/pyhs2.git
-
Install dependencies yum install gcc-c++ cyrus-sasl-* python-devel
-
Install setuptools wget -q http://peak.telecommunity.com/dist/ez_setup.py ./python ez_setup.py
If the above installation fails, you need to manually download the setuptools-0.6c11.tar.gz installation package to install
- Compile and install pyhs2
Enter the pyhs2 directory and install
cd pyhs2
python setup.py build
python setup.py install
- Write sample code
Sample code, that is, example.py under pyhs2
import pyhs2
with pyhs2.connect(host='uhadoop-******-master2',
port=10000,
authMechanism="PLAIN",
user='root',
password='test',
database='default') as conn:
with conn.cursor() as cur:
#Show databases
print cur.getDatabases()
#Execute query
cur.execute("select * from test_hive")
#Return column info from query
print cur.getSchema()
#Fetch table results
for i in cur.fetch():
print i
uhadoop-******-master2 should be replaced with the host name or IP of your cluster’s master2 node, test_hive should be replaced with an existing table name in hive
3.3 Hive external table reads HBase data
By creating an HBase external table in Hive, you can use simple SQL statements to analyze unstructured data in HBase
- Open HBase shell, create table t1
create 't1',{NAME => 'f1',VERSIONS => 2}
put 't1','rowkey001','f1:col1','value01'
put 't1','rowkey001','f1:col2','value02'
put 't1','rowkey001','f1:colf','value03'
scan 't1'
The structure of the obtained t1 table is as follows
hbase(main):013:0> scan 't1'
ROW COLUMN+CELL
rowkey001 column=f1:col1, timestamp=1481075364575, value=value01
rowkey001 column=f1:col2, timestamp=1481075364607, value=value02
rowkey001 column=f1:colf, timestamp=1481075364641, value=value03
- Open Hive Cli, create an external table
hive> CREATE EXTERNAL TABLE t_hive_hbase(
> rowkey string,
> cf map<STRING,STRING>
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:")
> TBLPROPERTIES ("hbase.table.name" = "t1");
- Use SQL statement to read hbase data, the result is as follows
hive> select * from t_hive_hbase;
OK
rowkey001 {"col1":"value01","col2":"value02","colf":"value03"}
4. Common Enhanced Features
4.1 Supports custom UDF functions
The following are two methods for writing UDF functions, the first one is through Eclipse, and the second one is directly compiling on the UHadoop cluster’s master node using shell scripts.
Writing custom Hive UDF function in Eclipse
1 Create a new java project in eclipse: hiveudf
2 Add a folder lib to the hiveudf work, and the lib package needs to copy the related jar package.
These four packages can be found in the following locations on uhadoop-******-master1
/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0-cdh5.4.9.jar
/home/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.6.0-cdh5.4.9.jar
/home/hadoop/share/hadoop/common/hadoop-common-2.6.0-cdh5.4.9.jar
/home/hadoop/hive/lib/hive-exec-1.2.1.jar
3 Add all jar packages in the lib directory to the build Path
4 Create a UDFLower class and inherit from the UDF class of hive, please leave the package of the new UDFLower class blank
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class UDFLower extends UDF{
public Text evaluate(final Text s){
if (null == s){
return null;
}
return new Text(s.toString().toLowerCase());
}
}
5 Package UDFLower.jar
— Execute export
— Choose JAR file
— Modify the export directory name and click Finish, and the UDFLower.jar is ready
— Use ssh command to upload to uhadoop master1 node
If master1 is already bound to the external network, it can be uploaded directly via the external network IP. If not, please jump through the machine with an external IP.
Writing custom Hive UDF function in Linux
- Create code
Log in to uhadoop-******-master1 node, go to /data directory, create UDF code
cd /data
mkdir hive_udf
touch UDFLower.java
UDFLower.java code is as follows
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class UDFLower extends UDF{
public Text evaluate(final Text s){
if (null == s){
return null;
}
return new Text(s.toString().toLowerCase());
}
}
- Create compile.sh file in the same directory
#!/bin/bash
rm -f UDFLower.jar
rm -f UDFLower*.class
export HADOOP_MAPRED_HOME=/home/hadoop
CLASSPATH=.:$HIVE_HOME/conf
for i in /home/hadoop/share/hadoop/common/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
for i in /home/hadoop/hive/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
javac -cp $CLASSPATH UDFLower.java
jar cvf UDFLower.jar UDFLower.class
Executing sh compile.sh would generate UDFLower.jar in the same directory
Using the generated UDFLower.jar for Hive example
-
Enter the directory where UDFLower.jar is located, upload to HDFS directory
hadoop fs -put UDFLower.jar /tmp/
-
Data preparation
Test file test.txt, content is as follows
HELLO WORLD HEHE
Upload to HDFS
hadoop fs -put test.txt /tmp/
- Create the relevant data table in Hive Cli and load the data
hive> create table test_hive_udf (name string);
hive> load data inpath '/test/1/test.txt' into table test_hive_udf;
- Create a temporary function
hive> create temporary function my_lower as 'UDFLower';
- Use the custom UDF
hive> select name from test_hive_udf;
hive> select my_lower(name) from test_hive_udf;
4.2 Supports json format data
- Data preparation
Upload test.json data to hdfs
test.json data is as follows
{"id":1234544353,"text":"test_hive_json"}
Upload data to HDFS
hdfs dfs -put test.json /tmp/test.json
- Load the dependency package
Hive parses json format data depending on hive-json-serde.jar, if you use beeline, you don’t need to load the corresponding jar package through add jar. If you are using the hive cli interface, you need to add jar, do as follows:
hive> add jar $HIVE_HOME/lib/hive-json-serde.jar;
hive 1.2.1 version does not provide this package yet
- Create a hive table
hive> CREATE TABLE test_json(id BIGINT,text STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS TEXTFILE;
- Load data
hive> LOAD DATA INPATH "/tmp/test.json" OVERWRITE INTO TABLE test_json;
- Run query
Returning the following indicates that the json file parsing was successful
hive> select * from test_json;
OK
1234544353 test_hive_json
4.3 Using regular match
- Data preparation
Upload test data nginx_log to hdfs
nginx_log data is as follows
180.173.250.74 168.34.34
Upload to HDFS
hdfs dfs -put nginx_log /tmp/nginx_log
- Load the dependency package
Hive’s regular matching uses depends on hive-contrib.jar, if you use beeline, you don’t need to load the corresponding jar package through add jar. If you are using the hive cli interface, you need to add jar, do as follows:
hive> add jar $HIVE_HOME/lib/hive-contrib.jar;
- Create a test table
hive> CREATE TABLE logs(
host STRING,
identity STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*)",
"output.format.string" = "%1$s %2$s"
)STORED AS TEXTFILE;
Annotation: When creating a table, you need to specify ROW FORMAT SERDE, SERDEPROPERTIES’ input.regex and output.format.string
- Load data
hive> LOAD DATA INPATH "/tmp/nginx_log" INTO TABLE logs;
- Test data
Returning the following indicates that the regular expression was successfully used
hive> select * from logs;
OK
180.173.250.74 168.34.34