PySpark - Installing Python libraries in Cluster

senthiljdpm · July 28, 2018, 3:41am

Hi,

While working on a PySpark assignment in cluster, I came across a requirement to install some dependency libraries from Python 3. May i know how to install libraries in cluster? If installed, will it be installed across the cluster on all executors?

Library to install: itertools

abhinavsingh · July 28, 2018, 4:17pm

Hi @senthiljdpm,

Are you running PySpark from command line or from Jupyter? Also which version are you using?

senthiljdpm · July 30, 2018, 8:17am

@abhinavsingh,

I use the version Spark-2.3.1. Also, use Jupyter notebook for coding.

senthiljdpm · August 1, 2018, 2:28am

@abhinavsingh,

Please revert if you have any inputs on this.

abhinav · August 1, 2018, 8:38am

Hi @senthiljdpm,

We do not have Spark-2.3.1 on the cluster. I am wondering how are you accessing it then on the lab.

senthiljdpm · August 1, 2018, 9:55am

@abhinavsingh,

I looked at the list of Spark installations and found out that 2.3.1 exists… So, accessing it through Jupyter using the following command… Please correct me if am looking at something else?

export SPARK_HOME="/usr/spark-2.3.1/"
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
export PATH=/usr/local/anaconda/bin:$PATH
jupyter notebook --no-browser --ip 0.0.0.0 --port 8888

senthiljdpm · August 11, 2018, 6:21am

@abhinavsingh
Any comments?

raviteja · August 11, 2018, 3:36pm

@senthiljdpm,

“itertools” is default Python library it should be available with PySpark. can you please try below command in PySpark 2.3.1 & let us know the response:

import itertools

senthiljdpm · August 15, 2018, 9:43am

@raviteja,

My question was more on installing Python dependency libraries in Spark Cluster. I am not sure how itertools library will help me to do this? Can you explain a bit more please?

raviteja · August 20, 2018, 4:54am

@senthiljdpm,

Python ‘itertools’ package is for iterating data in list, dict & other complex data structures in Python.
Now to install Python dependencies for Spark Cluster, you need to install in one node & if it works then you can install them manually in all nodes & launch your PySpark programs.