PySpark - Installing Python libraries in Cluster

Hi,

While working on a PySpark assignment in cluster, I came across a requirement to install some dependency libraries from Python 3. May i know how to install libraries in cluster? If installed, will it be installed across the cluster on all executors?

Library to install: itertools

Hi @senthiljdpm,

Are you running PySpark from command line or from Jupyter? Also which version are you using?

@abhinavsingh,

I use the version Spark-2.3.1. Also, use Jupyter notebook for coding.

@abhinavsingh,

Please revert if you have any inputs on this.

Hi @senthiljdpm,

We do not have Spark-2.3.1 on the cluster. I am wondering how are you accessing it then on the lab.

@abhinavsingh,

I looked at the list of Spark installations and found out that 2.3.1 exists… So, accessing it through Jupyter using the following command… Please correct me if am looking at something else?

export SPARK_HOME="/usr/spark-2.3.1/"
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
export PATH=/usr/local/anaconda/bin:$PATH
jupyter notebook --no-browser --ip 0.0.0.0 --port 8888

@abhinavsingh
Any comments?

@senthiljdpm,

“itertools” is default Python library it should be available with PySpark. can you please try below command in PySpark 2.3.1 & let us know the response:

import itertools

@raviteja,

My question was more on installing Python dependency libraries in Spark Cluster. I am not sure how itertools library will help me to do this? Can you explain a bit more please?

@senthiljdpm,

Python ‘itertools’ package is for iterating data in list, dict & other complex data structures in Python.
Now to install Python dependencies for Spark Cluster, you need to install in one node & if it works then you can install them manually in all nodes & launch your PySpark programs.