I am trying to get my pyspark application run on yarn cluster. The application uses certain libraries which require python3. However the yarn cluster uses python2 and does not have python3 installed. Is there a way to package my pyspark application with python3 and all the core python3 libraries.
I have been following these steps rougly to create virtual environment
virtualenv -p /usr/bin/python3 venv/
source venv/bin/activate
pip install -r requirements.txt
venv-pack -o environment.tar.gz
/usr/bin/spark-submit --master yarn --executor-cores 1 --num-executors 15 --queue wesp_dev --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DENVIRONMENT=pt" --conf "spark.executor.extraJavaOptions=-DENVIRONMENT=pt" --name "EndpointAttackDetectionEngine" --class com.telus.endpointAttackDetectionEngine.AppMain --keytab $KEY_TAB --principal $PRINCIPAL --driver-memory=4G --archives environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.LD_LIBRARY_PATH=./environment/lib/ --conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=./environment/lib/ test.py
However number of issues when I created the virtual environment following the steps above
- I noticed that the python interpreter in venv/bin/python is symlinked to /usr/bin/python. I had to manually delete the symlinks and just copied the python interpreter over. Because the cluster would not have python3 at /usr/bin/python.
- libpython3.6m.so.1.0 was missing. Pyspark application was failing initially because of that. I manually copied that over to venv/lib/ and specify it in
spark.executorEnv.LD_LIBRARY_PATH=./environment/lib/andspark.yarn.appMasterEnv.LD_LIBRARY_PATH=./environment/lib/in spark-submit - Now I am stuck on
Fatal Python error: Py_Initialize: Unable to get the locale encoding ModuleNotFoundError: No module named 'encodings'when I run the pysark application in the yarn cluster. I manually copied encodings and other core modules over for python3 from /usr/lib64/python3.6 but doesn't fix the problem.
All this leads me to believe there is something I am missing when I package my environment with venv. There has to be a better way of doing this.