Docs
uhadoop
Developer Guide
Python Development Guide

developer/pythondev.md

Python Development Guide

If you use PySpark for data analysis in machine learning, you need to install some Python dependency packages on the cluster. This article will introduce the installation methods of several commonly used dependency packages. For the download and installation of more dependency packages, you can refer to PyPI website.

Since some dependency packages do not support version 2.6. Therefore, all installations below are based on Python2.7. Python on the cluster is recommended to be upgraded to version 2.7.

1. NumPy

NumPy is a scientific computing package implemented in Python, which can be used to store and process large matrices, and is much more efficient than Python’s own nested list structure.

The latest version of NumPy can be found by searching on the PyPI website.

Take numpy-1.12.0 as an example, the installation method is as follows:

unzip numpy-1.12.0.zip
cd numpy-1.12.0
python setup.py install

2. SciPy

SciPy is a Python tool package designed for science and engineering.

The latest version of SciPy can be found by searching on the PyPI website.

NumPy must be installed before installing Scipy.

Take scipy-0.18.1 as an example, the installation method is as follows:

tar zxf scipy-0.18.1.tar.gz
cd scipy-0.18.1
python setup.py install

3. Scikit-Learn

Scikit-Learn is a toolkit specifically for machine learning under SciPy.

The latest version of Scikit-Learn can be found by searching on the PyPI website.

NumPy and Scipy must be installed before installing Scikit-Learn.

Take scikit-learn-0.18.1 as an example, the installation method is as follows:

tar zxf scikit-learn-0.18.1.tar.gz
cd scikit-learn-0.18.1
python setup.py install

4. Sympy

SymPy is a mathematical symbol calculation library of Python, which can be used for symbolic derivation of mathematical formulas.

The latest version of SymPy can be found by searching on the PyPI website.

Take sympy-1.0 as an example, the installation method is as follows:

tar zxf sympy-1.0.tar.gz
cd sympy-1.0
python setup.py install

5. Pandas

Pandas (Python Data Analysis Library) is a tool based on NumPy that solves data analysis tasks.

The latest version of Pandas can be found by searching on the PyPI website.

Take pandas-0.19.2 as an example, the installation method is as follows:

tar zxf pandas-0.19.2.tar.gz
cd pandas-0.19.2
python setup.py install

6. Matplotlib

Matplotlib is a common drawing library for Python, which provides a full set of command APIs similar to matlab, which is very suitable for interactive drawing.

The latest version of Matplotlib can be found by searching on the PyPI website.

Take matplotlib-2.0.0 as an example, the installation method is as follows:

yum install libpng-devel libpng -y
tar zxf matplotlib-2.0.0.tar.gz
cd matplotlib-2.0.0
python setup.py install

7. MySQLdb

MySQLdb is an interface provided by Python to connect to MySQL.

The latest version of MySQLdb can be found by searching on the PyPI website.

Take MySQL-python-1.2.5 as an example, the installation method is as follows:

yum install python-pip python-devel mysql-devel zlib-devel openssl-devel -y
unzip MySQL-python-1.2.5.zip
cd MySQL-python-1.2.5
python setup.py install