Spark

How to Use Spark-sql to Access Hive Tables?

If Spark-sql can’t access Hive tables, please check if the following operations are completed. Those that are not completed require operations on the uhadoop cluster and Spark client:

- Copy /home/hadoop/hive/conf/hive-site.xml and /home/hadoop/hbase/conf/hbase-site.xml to /home/hadoop/spark/conf/;

- Copy /home/hadoop/hive/lib/hive-serde-*-cdh*.jar to /home/hadoop/spark/lib/;

- Add the following configuration in /home/hadoop/spark/conf/spark-env.sh:


export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hadoop/lib/native
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/hadoop/lib/native
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/share/hadoop/common/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/spark/lib/:/home/hadoop/hive/lib/:/home/hadoop/hbase/lib/*

What to do if java.sql.SQLException is Prompted When Using Spark to Connect to the Database?

If the following error occurs when using Spark to connect to the database:


Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:mysql://

This is because the mysql package used on the cluster is version 5.1.17 (home/hadoop/hive/lib/mysql-connector-java-5.1.17.jar), and the database that the user connects to may be a higher version. You can download the latest package from the mysql official website to replace it.

Example:


spark-submit --class users_day_activity --master yarn-client --jars /root/hive/lib/mysql-connector-java-5.6.jar --executor-memory 2g --num-executors 5

Other databases or services may also encounter this problem, and you should be able to use the corresponding package.

How to View Spark’s Running Task Logs

A small part of the stock cluster has not configured spark’s history, so you can’t see the log of the spark task, you need to reconfigure it.

The specific method is as follows:

1.Configure the jobhistory service for spark

Modify the configuration file /home/hadoop/spark/conf/spark-defaults.conf


    spark.history.ui.port 18080
    spark.eventLog.dir hdfs://Ucluster/var/log/spark
    spark.eventLog.enabled  true
    spark.yarn.historyServer.address uhadoop-XXXXXX-master2:18080
    spark.history.fs.logDirectory hdfs://Ucluster/var/log/spark

Annotation: 1.All nodes of the cluster need to modify the configuration, including the client submitting the task; 2.uhadoop-XXXXXX needs to be modified to the cluster ID.

2.Configure log-url for NodeManager

On all nodemanger nodes, add the following configuration to /home/hadoop/conf/yarn-site.xml:


     <property>
      <name>yarn.log.server.url</name>
      <value>http://uhadoop-XXXXXX-master2:19888/jobhistory/logs</value>
     </property>

The address is configured as the hostname of the cluster’s master2.

After modifying, restart the nodemanager service.

3.Create a directory

On all nodemanger nodes, execute the following commands as the root user:


    su -s /bin/bash hadoop -c 'hdfs dfs -mkdir hdfs://Ucluster/var/log/spark'
    su -s /bin/bash hadoop -c 'hdfs dfs -chmod 777  hdfs://Ucluster/var/log/spark'

4.Start the service

On the master2 node, use the hadoop user to execute the /home/hadoop/spark/sbin/start-history-server.sh script.