The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5).
I build the python environment with conda, and use it in pyspark:
conda create -n pyspark python=3.9 --show-channel-urls --channel=conda-forge --override-channels
conda init bash
python -m pip install conda-pack # separate from the req.txt because no hash is given.
conda run -n pyspark python -m pip install -r req.txt
conda pack -n pyspark --output ./pulse-spark-deployment.tar.gz
It use used with the command line (all in one line, split for ease of reading )
bash -c "
PYSPARK_PYTHON=./environment/bin/python
PYTHONPATH=./app
spark-submit
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.yarn.appMasterEnv.PYTHONPATH=./app
--master yarn
--deploy-mode cluster
--packages
org.apache.spark:spark-avro_2.12:3.5.2,
org.apache.hadoop:hadoop-aws:3.4.0,
org.apache.spark:spark-hadoop-cloud_2.12:3.5.2
--archives
s3://<bucke>/spark/spark-deployment.tar.gz#environment,
s3://<bucket>/spark/spark.zip#app
s3://<bucket>/spark/script.py
"
It works perfectly if I use r7a instances, it fails if I use graviton (r7g or r8g).
The errors I get form yarn are:
User application exited with 126
and
./environment/bin/python: ./environment/bin/python: cannot execute binary file
This is typical from an executable for the wrong architecture, but adding --platform-linux-aarch64 to the conda create line does not change anything.
What could go wrong here?