I am running an EMR with the following creation statement:
$ aws emr create-cluster \
--name "my_cluster" \
--log-uri "s3n://somebucket/" \
--release-label "emr-6.8.0" \
--service-role "arn:aws:iam::XXXXXXXXXX:role/EMR_DefaultRole" \
--ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","KeyName":"some_key","AdditionalMasterSecurityGroups":[],"AdditionalSlaveSecurityGroups":[],"ServiceAccessSecurityGroup":"sg-xxxxxxxx","SubnetId":"subnet-xxxxxxxx"}' \
--applications Name=Spark Name=Zeppelin \
--configurations '[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]' \
--instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","Name":"Core","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]},{"InstanceCount":1,"InstanceGroupType":"MASTER","Name":"Primary","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}]' \
--bootstrap-actions '[{"Args":[],"Name":"install python package","Path":"s3://something/bootstrap/bootstrap-script.sh"}]' \
--scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
--auto-termination-policy '{"IdleTimeout":3600}' \
--step-concurrency-level "3" \
--os-release-label "2.0.20230418.0" \
--region "us-east-1"
My bootstrap script (bootstrap-script.sh):
#!/bin/bash
echo -e 'Installing Boto3... \n'
which pip3
which python3
pip3 install -U boto3 botocore --user
Once the EMR is up, I add this step:
$ spark-submit --deploy-mode cluster s3://something/py-spark/simple.py
Simple.py is just like this:
import boto3
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Simple test') \
.getOrCreate()
spark.stop()
My step fails, with:
ModuleNotFoundError: No module named 'boto3'
I logged on the master node as hadoop and ran:
$ pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
click==8.1.3
docutils==0.14
jmespath==1.0.1
joblib==1.1.0
lockfile==0.11.0
lxml==4.9.1
mysqlclient==1.4.2
nltk==3.7
nose==1.3.4
numpy==1.20.0
py-dateutil==2.2
pystache==0.5.4
python-daemon==2.2.3
python37-sagemaker-pyspark==1.4.2
pytz==2022.2.1
PyYAML==5.4.1
regex==2021.11.10
simplejson==3.2.0
six==1.13.0
tqdm==4.64.0
windmill==1.6
Yet, in my bootstrap logs:
Installing Boto3...
/usr/bin/pip3
/usr/bin/python3
Collecting boto3
Downloading boto3-1.26.133-py3-none-any.whl (135 kB)
Collecting botocore
Downloading botocore-1.29.133-py3-none-any.whl (10.7 MB)
Collecting s3transfer<0.7.0,>=0.6.0
Downloading s3transfer-0.6.1-py3-none-any.whl (79 kB)
Requirement already satisfied, skipping upgrade: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3) (1.0.1)
Collecting urllib3<1.27,>=1.25.4
Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Collecting python-dateutil<3.0.0,>=2.1
Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore) (1.13.0)
Installing collected packages: urllib3, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.26.133 botocore-1.29.133 python-dateutil-2.8.2 s3transfer-0.6.1 urllib3-1.26.15
And the log looks the same for all my nodes.
So on the master, as hadoop, I ran:
$ which python3
/bin/python3
Then, just to verify my bootstrap actually did something:
$ /usr/bin/pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
boto3==1.26.133
botocore==1.29.133
click==8.1.3
docutils==0.14
So the python3 I updated in the bootstrapper (/usr/bin/python3) is not the same that's used by default for hadoop.
Yet I tried to make sure pyspark uses the right "python" in my EMR configs:
{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}
But PYSPARK_PYTHON doesn't seem to be set on any of the nodes when I login. I do not understand why.
I am looking for the correct steps to follow to get my "import boto3" working from my pyspark script (I do not want to make changes to simple.py).
Update: it seems to work in client mode:
$ spark-submit --deploy-mode client s3://something/py-spark/simple.py
But of course, I want to run it in production, in cluster mode...