How to package pyspark code with virtualenv including the python3 interpreter?

Question

I am trying to get my pyspark application run on yarn cluster. The application uses certain libraries which require python3. However the yarn cluster uses python2 and does not have python3 installed. Is there a way to package my pyspark application with python3 and all the core python3 libraries.

I have been following these steps rougly to create virtual environment

virtualenv -p /usr/bin/python3 venv/
source venv/bin/activate
pip install -r requirements.txt
venv-pack -o environment.tar.gz
/usr/bin/spark-submit --master yarn --executor-cores 1  --num-executors 15  --queue wesp_dev   --deploy-mode cluster  --conf "spark.driver.extraJavaOptions=-DENVIRONMENT=pt"  --conf "spark.executor.extraJavaOptions=-DENVIRONMENT=pt"  --name "EndpointAttackDetectionEngine"   --class com.telus.endpointAttackDetectionEngine.AppMain   --keytab $KEY_TAB  --principal $PRINCIPAL --driver-memory=4G --archives environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.LD_LIBRARY_PATH=./environment/lib/ --conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=./environment/lib/ test.py

However number of issues when I created the virtual environment following the steps above

I noticed that the python interpreter in venv/bin/python is symlinked to /usr/bin/python. I had to manually delete the symlinks and just copied the python interpreter over. Because the cluster would not have python3 at /usr/bin/python.
libpython3.6m.so.1.0 was missing. Pyspark application was failing initially because of that. I manually copied that over to venv/lib/ and specify it in spark.executorEnv.LD_LIBRARY_PATH=./environment/lib/ and spark.yarn.appMasterEnv.LD_LIBRARY_PATH=./environment/lib/ in spark-submit
Now I am stuck on Fatal Python error: Py_Initialize: Unable to get the locale encoding ModuleNotFoundError: No module named 'encodings' when I run the pysark application in the yarn cluster. I manually copied encodings and other core modules over for python3 from /usr/lib64/python3.6 but doesn't fix the problem.

All this leads me to believe there is something I am missing when I package my environment with venv. There has to be a better way of doing this.

a better tool for packaging standalone python programs is pex. medium.com/criteo-labs/… — dre-hh
– dre-hh, Commented Feb 12, 2020 at 8:50
If you have to package the interpreter as well, then you will need more advanced tools. I believe BeeWare's briefcase could do such a thing, and there are probably others (pyinstaller, etc.) — sinoroc
– sinoroc, Commented Feb 12, 2020 at 10:36

santifinland · Accepted Answer · 2020-09-30 10:56:48Z

I've recently faced the same situation.

When you want to work with pyspark in Python3 in a cluster that has not Python3 installed, you can indeed follow the strategy of using a virtualenv.

However, you have to pick carefully the tool for building the virtualenv. As you noticed, the usual virtualenv command does not fully package the needed Python3 files.

With a conda Python3 based virtualenv, I've sucessfully managed to use pyspark, in both yarn cluster and client modes, inside a Python2.7 yarn cluster.

This is my step by step guide:

Create a docker centos 7 container

docker pull centos:centos7
docker run --entrypoint "/bin/sh" -it <image_id_centos:centos7>

Create the virtualenv inside the centos7 created container

yum install bzip2
curl -O https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh
bash Anaconda3-5.3.1-Linux-x86_64.sh
source ~/.bashrc
conda update conda
conda create -n myvenv python=3.6 anaconda
conda activate myvenv
conda install -n myvenv pip
pip install -r requirements.txt
cd /root/anaconda3/envs/myvenv
tar zcvf myvenv.tar.gz *

Extract you myvenv virtualenv from the docker container

docker cp <container-id>:/root/anaconda3/envs/myvenv/myvenv.tar.gz ~/Downloads

Use the new virtualenv in the cluster.

Do not forget to tag it as #ENV and use the tag in spark.pyspark.python and spark.pyspark.driver.python conf options.

${SPARK_HOME}/bin/spark-submit \
--master ${CLUSTER_MANAGER} \
--deploy-mode ${DEPLOY_MODE} \
--num-executors ${NUM_EXECUTORS} \
--driver-memory ${DRIVER_MEMORY} \
--executor-memory ${EXECUTOR_MEMORY} \
--executor-cores ${EXECUTOR_CORES} \
--archives ./myvenv.tar.gz#ENV \
--conf spark.pyspark.python=./ENV/bin/python \
--conf spark.pyspark.driver.python=./ENV/bin/python \
--py-files my_project.zip \
main.py "$@"

Collectives™ on Stack Overflow

How to package pyspark code with virtualenv including the python3 interpreter?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related