Sqoop Development Guide
Sqoop is a tool that can intertransably transfer data between Hadoop and relational databases. It can import data from a relational database (such as MySQL, Oracle, Postgres, etc.) into Hadoop’s HDFS, or it can also export data from HDFS into a relational database.
Note:
- For Sqoop to import and export data in MySQL, it is necessary to ensure that the data in MySQL can be accessed by remote users, otherwise permission errors will be reported.
- For detailed use, please refer to the official web page
1. Basic Operations
1.1 Sqoop Installation Configuration
In UHadoop, Sqoop is installed by default with Oozie. If you have selected Oozie when creating a cluster, Sqoop will be installed on the uhadoop-*****-master2 node. If you need to install Sqoop separately, you can follow the steps below, otherwise, you can skip it.
a. Script Installation in UHadoop
Taking the installation on the master1 node as an example.
- Execute the following command as the root user on the master1 node, it will be installed in /home/hadoop/ directory by default.
cdh5.4.9
sh /home/hadoop/.versions/umrAgent/script/install.sh sqoop 1.4.5 cdh5.4.9
Run source ~/.bashrc
cdh5.13.3
sh /home/hadoop/.versions/umrAgent/script/install.sh sqoop 1.4.7 cdh5.13.3
Run source ~/.bashrc
b. Tar Package Installation (for cdh5.4.9, cdh5.13.3 can use the script installation described above)
The following example is the process of downloading the tar package and installing it on the UHadoop node; If you are installing on a non-UHadoop node, you need to change the related paths, and you can refer to it as appropriate.
- Download sqoop-1.4.5-cdh5.4.9.tar.gz from http://archive.cloudera.com/cdh5/cdh/5/ and extract it to /home/hadoop/.versions/
wget "http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.5-cdh5.4.9.tar.gz"
tar -zvxf sqoop-1.4.5-cdh5.4.9.tar.gz -C /home/hadoop/.versions/
- Create a soft link under the Hadoop user
ln -s /home/hadoop/.versions/sqoop-1.4.5-cdh5.4.9/ /home/hadoop/sqoop
- Add environment variables vim ~/.bashrc
# sqoop
export SQOOP_HOME=$HADOOP_HOME/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
Run source ~/.bashrc
- Configure Sqoop environment variables
cp $SQOOP_HOME/conf/sqoop-env-template.sh $SQOOP_HOME/conf/sqoop-env.sh
Modify the following parameters:
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/hadoop
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/hadoop
#set the path to where bin/hbase is available
export HBASE_HOME=/home/hadoop/hbase
#Set the path to where bin/hive is available
export HIVE_HOME=/home/hadoop/hive
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/home/hadoop/zookeeper/conf
- Copy related dependencies
cd /home/hadoop/sqoop
cp /home/hadoop/hive/lib/mysql-connector-java-*.jar ./lib/
cp -rf /home/hadoop/share/hadoop/mapreduce/* ./lib/
1.2 Export from MySQL to HDFS
- The MySQL information is as follows:
Take MySQL as an example:
• 10.10.50.79
• database: hehe
• User: test Password: test
The table information is as follows:
Execute Statements
sqoop import --connect jdbc:mysql://10.10.50.79/hehe --username test --password test --table t_hadoop_version --target-dir /tmp/sqoop-import-hdfs
Note
- 10.10.50.79 -IP address of MySQL
- —username -The user name for accessing the data
- —password -The password for accessing the data
- —table t_hadoop_version -The name of the data table
- —target-dir -The target directory to import data into HDFS
View Results
1.3 Export from MySQL to Hive
- The MySQL information is as follows:
Execute Statements
sqoop import --connect jdbc:mysql://10.10.50.79/hehe --username test --password test --table t_hadoop_version --warehouse-dir /user/hive/warehouse --hive-import --create-hive-table
Note:
- 10.10.50.79 is the IP address of MySQL
- —username test -The user name for accessing the data
- —password test -The password for accessing the data
- —table t_hadoop_version -The name of the data table
- —warehouse-dir -The HDFS directory of the hive database
View Results
The table t_hadoop_version in MySQL has been imported into Hive
1.4 Export from Hive to MySQL
The data in the Hive table is as follows:
Execute Statements
sqoop export --connect jdbc:mysql://10.10.50.79/hehe --username test --password test --table t_hadoop_version --export-dir /user/hive/warehouse/t_hadoop_version --fields-terminated-by "\t"
Note
- 10.10.50.79 is the IP address of MySQL
- —username test -The user name for accessing the data
- —password test -The password for accessing the data
- —table t_hadoop_version -The name of the data table
- —export-dir -The directory of the data to be exported
- —fields-terminated-by -The way to split the data
View Results
The result in the MySQL database + The result of creating the database in Hive are shown as follows:
2. Using Sqoop in UHadoop
In this example, on the Master2 node of the UHadoop cluster, sqoop is used to import the data in the MySQL database in the internal network (UDB or self-built database) into the Hive of the UHadoop cluster in an incremental manner.
Note: There are sqoop deployments on Master1 in the existing clusters, please log in and test.
When importing incremental data from MySQL to Hive, it is necessary to judge the incremental data to be imported according to the data of a certain column in the original data.
In the operation process of this example, mainly through the construction of test databases, tables, data in MySQL, and then based on the original data table label timestamp, and through comparison with the last value of timestamp imported into Hive database, to determine the data needed for incremental import.
2.1 Create Libraries and Tables
Create the export database (data_demo), table (data_export) in MySQL
create database data_demo;
create table data_demo.data_export (i int , t timestamp(3), cn VARCHAR(64));
2.2 Create Database in Hive
create database data_demo1;
create table data_demo1.data_import (i int , t VARCHAR(64), cn VARCHAR(64));
2.3 Generate Data
Create a script (gene_data.sh) to generate test data and run the script to generate data and insert it into the database data_demo
touch gene_data.sh
The script content is as follows:
#!/bin/bash
#update
i=$(($RANDOM % 10))
up="insert into data_export values ('$i' , current_timestamp() , '中文\n') ON DUPLICATE KEY UPDATE t=current_timestamp()"
echo $up
mysql data_demo -e "$up"
i=$RANDOM
insert="insert into data_export values ('$i' , current_timestamp() , '中文\n');"
echo $insert
mysql data_demo -e "$insert"
Execute the script to insert data
bash gene_data.sh
2.4 Add Permissions
The MySQL database needs to add access permissions to the target HIVE database when exporting data. The following scripts can be executed to add permissions to the HIVE node.
Create a script file
touch grant.sh
Edit the script file content as follows:
USER_NAME=hello
USER_PASSWD=word
DATA_BASE=data_demo
for h in `grep uhadoop /etc/hosts | awk '{print $2}'`
do
mysql -e "CREATE USER '$USER_NAME'@'$h' IDENTIFIED BY '$USER_PASSWD'"
echo "CREATE USER '$USER_NAME'@'$h' IDENTIFIED BY '$USER_PASSWD'"
mysql -e "GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON $DATA_BASE.* TO '$USER_NAME'@'$h'"
echo "GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON $DATA_BASE.* TO '$USER_NAME'@'$h'"
done
mysql -e "FLUSH PRIVILEGES"
Execute the script to modify permissions
bash grant.sh
2.5 Import Data
Note: In this example, the MySQL address is “10.10.115.1”, please modify $USER and $PASSWORD as needed, and the “—last-value” needs to be in accordance with the actual time of the test data generated in MySQL.
sqoop import --connect jdbc:mysql://10.10.115.1:3306/data_demo --username $USER --password $PASSWORD --table data_export --warehouse-dir /user/hive/warehouse/tmp --hive-import --hive-table data_demo1.data_import --check-column t --incremental lastmodified --last-value '2016-11-15 17:22:24.496' --merge-key i --hive-drop-import-delims --hive-overwrite -m 1
Refer to the sqoop user guide for detailed interpretation of the parameters
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Possible problems encountered during the test are as follows
- The data import fails in the task node or core node of the UHadoop cluster due to the lack of MySQL permissions
Please give all nodes in the cluster select permission to the MySQL database
- When importing data, there will only be one mapreduce task running
You can specify the number of map tasks through the -m parameter, the command is 1.