Sqoop Development Guide

Sqoop is a tool that can intertransably transfer data between Hadoop and relational databases. It can import data from a relational database (such as MySQL, Oracle, Postgres, etc.) into Hadoop’s HDFS, or it can also export data from HDFS into a relational database.

Note:

For Sqoop to import and export data in MySQL, it is necessary to ensure that the data in MySQL can be accessed by remote users, otherwise permission errors will be reported.

For detailed use, please refer to the official web page

1. Basic Operations

1.1 Sqoop Installation Configuration

In UHadoop, Sqoop is installed by default with Oozie. If you have selected Oozie when creating a cluster, Sqoop will be installed on the uhadoop-*****-master2 node. If you need to install Sqoop separately, you can follow the steps below, otherwise, you can skip it.

a. Script Installation in UHadoop

Taking the installation on the master1 node as an example.

Execute the following command as the root user on the master1 node, it will be installed in /home/hadoop/ directory by default.


cdh5.4.9
sh /home/hadoop/.versions/umrAgent/script/install.sh sqoop 1.4.5 cdh5.4.9
Run source ~/.bashrc


cdh5.13.3
sh /home/hadoop/.versions/umrAgent/script/install.sh sqoop 1.4.7 cdh5.13.3
Run source ~/.bashrc

b. Tar Package Installation (for cdh5.4.9, cdh5.13.3 can use the script installation described above)

The following example is the process of downloading the tar package and installing it on the UHadoop node; If you are installing on a non-UHadoop node, you need to change the related paths, and you can refer to it as appropriate.

Download sqoop-1.4.5-cdh5.4.9.tar.gz from http://archive.cloudera.com/cdh5/cdh/5/ and extract it to /home/hadoop/.versions/


  wget "http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.5-cdh5.4.9.tar.gz"
  tar -zvxf sqoop-1.4.5-cdh5.4.9.tar.gz -C /home/hadoop/.versions/

Create a soft link under the Hadoop user


  ln -s /home/hadoop/.versions/sqoop-1.4.5-cdh5.4.9/ /home/hadoop/sqoop

Add environment variables vim ~/.bashrc


# sqoop
export SQOOP_HOME=$HADOOP_HOME/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
Run source ~/.bashrc

Configure Sqoop environment variables


  cp $SQOOP_HOME/conf/sqoop-env-template.sh  $SQOOP_HOME/conf/sqoop-env.sh

Modify the following parameters：



#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/hadoop

#set the path to where bin/hbase is available
export HBASE_HOME=/home/hadoop/hbase

#Set the path to where bin/hive is available
export HIVE_HOME=/home/hadoop/hive
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/home/hadoop/zookeeper/conf

Copy related dependencies


 cd /home/hadoop/sqoop
 cp /home/hadoop/hive/lib/mysql-connector-java-*.jar ./lib/
 cp -rf /home/hadoop/share/hadoop/mapreduce/* ./lib/

1.2 Export from MySQL to HDFS

The MySQL information is as follows:

Take MySQL as an example:

• 10.10.50.79

• database: hehe

• User: test Password: test

The table information is as follows:

Execute Statements


  sqoop import --connect jdbc:mysql://10.10.50.79/hehe --username test --password test --table t_hadoop_version  --target-dir /tmp/sqoop-import-hdfs

Note

10.10.50.79 -IP address of MySQL

—username -The user name for accessing the data

—password -The password for accessing the data

—table t_hadoop_version -The name of the data table

—target-dir -The target directory to import data into HDFS

View Results

1.3 Export from MySQL to Hive

The MySQL information is as follows:

Execute Statements


  sqoop import --connect jdbc:mysql://10.10.50.79/hehe --username test --password test --table t_hadoop_version  --warehouse-dir /user/hive/warehouse --hive-import --create-hive-table

Note:

10.10.50.79 is the IP address of MySQL
—username test -The user name for accessing the data
—password test -The password for accessing the data
—table t_hadoop_version -The name of the data table
—warehouse-dir -The HDFS directory of the hive database

View Results

The table t_hadoop_version in MySQL has been imported into Hive

1.4 Export from Hive to MySQL

The data in the Hive table is as follows:

Execute Statements


  sqoop export --connect jdbc:mysql://10.10.50.79/hehe --username test --password test --table t_hadoop_version  --export-dir /user/hive/warehouse/t_hadoop_version --fields-terminated-by "\t"

Note

10.10.50.79 is the IP address of MySQL
—username test -The user name for accessing the data
—password test -The password for accessing the data
—table t_hadoop_version -The name of the data table
—export-dir -The directory of the data to be exported
—fields-terminated-by -The way to split the data

View Results

The result in the MySQL database + The result of creating the database in Hive are shown as follows:

2. Using Sqoop in UHadoop

In this example, on the Master2 node of the UHadoop cluster, sqoop is used to import the data in the MySQL database in the internal network (UDB or self-built database) into the Hive of the UHadoop cluster in an incremental manner.

Note: There are sqoop deployments on Master1 in the existing clusters, please log in and test.

When importing incremental data from MySQL to Hive, it is necessary to judge the incremental data to be imported according to the data of a certain column in the original data.

In the operation process of this example, mainly through the construction of test databases, tables, data in MySQL, and then based on the original data table label timestamp, and through comparison with the last value of timestamp imported into Hive database, to determine the data needed for incremental import.

2.1 Create Libraries and Tables

Create the export database (data_demo), table (data_export) in MySQL


create database data_demo;
create table data_demo.data_export (i int , t timestamp(3), cn VARCHAR(64));

2.2 Create Database in Hive


create database data_demo1;
create table data_demo1.data_import (i int , t VARCHAR(64), cn VARCHAR(64));

2.3 Generate Data

Create a script (gene_data.sh) to generate test data and run the script to generate data and insert it into the database data_demo


touch gene_data.sh

The script content is as follows:


#!/bin/bash

#update
i=$(($RANDOM % 10))
up="insert into data_export values ('$i' , current_timestamp() , '中文\n') ON DUPLICATE KEY UPDATE t=current_timestamp()"
echo $up
mysql data_demo -e "$up"
i=$RANDOM
insert="insert into data_export values ('$i' , current_timestamp() , '中文\n');"
echo $insert
mysql data_demo -e "$insert"

Execute the script to insert data


bash gene_data.sh

2.4 Add Permissions

The MySQL database needs to add access permissions to the target HIVE database when exporting data. The following scripts can be executed to add permissions to the HIVE node.

Create a script file


touch grant.sh

Edit the script file content as follows:


USER_NAME=hello
USER_PASSWD=word
DATA_BASE=data_demo
for h in `grep uhadoop /etc/hosts | awk '{print $2}'`
do
    mysql -e "CREATE USER '$USER_NAME'@'$h' IDENTIFIED BY '$USER_PASSWD'"
    echo "CREATE USER '$USER_NAME'@'$h' IDENTIFIED BY '$USER_PASSWD'"
    mysql -e "GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON $DATA_BASE.* TO '$USER_NAME'@'$h'"
    echo "GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON $DATA_BASE.* TO '$USER_NAME'@'$h'"
done
mysql -e "FLUSH PRIVILEGES"

Execute the script to modify permissions


bash grant.sh

2.5 Import Data

Note: In this example, the MySQL address is “10.10.115.1”, please modify $USER and $PASSWORD as needed, and the “—last-value” needs to be in accordance with the actual time of the test data generated in MySQL.


sqoop import --connect jdbc:mysql://10.10.115.1:3306/data_demo --username $USER --password $PASSWORD --table data_export   --warehouse-dir /user/hive/warehouse/tmp --hive-import --hive-table  data_demo1.data_import --check-column t --incremental lastmodified --last-value '2016-11-15 17:22:24.496'  --merge-key i  --hive-drop-import-delims --hive-overwrite -m 1

Refer to the sqoop user guide for detailed interpretation of the parameters

http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Possible problems encountered during the test are as follows

The data import fails in the task node or core node of the UHadoop cluster due to the lack of MySQL permissions

Please give all nodes in the cluster select permission to the MySQL database

When importing data, there will only be one mapreduce task running

You can specify the number of map tasks through the -m parameter, the command is 1.