1

I am running an EMR with the following creation statement:

$ aws emr create-cluster \
 --name "my_cluster" \
 --log-uri "s3n://somebucket/" \
 --release-label "emr-6.8.0" \
 --service-role "arn:aws:iam::XXXXXXXXXX:role/EMR_DefaultRole" \
 --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","KeyName":"some_key","AdditionalMasterSecurityGroups":[],"AdditionalSlaveSecurityGroups":[],"ServiceAccessSecurityGroup":"sg-xxxxxxxx","SubnetId":"subnet-xxxxxxxx"}' \
 --applications Name=Spark Name=Zeppelin \
 --configurations '[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]' \
 --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","Name":"Core","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]},{"InstanceCount":1,"InstanceGroupType":"MASTER","Name":"Primary","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}]' \
 --bootstrap-actions '[{"Args":[],"Name":"install python package","Path":"s3://something/bootstrap/bootstrap-script.sh"}]' \
 --scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
 --auto-termination-policy '{"IdleTimeout":3600}' \
 --step-concurrency-level "3" \
 --os-release-label "2.0.20230418.0" \
 --region "us-east-1"

My bootstrap script (bootstrap-script.sh):

#!/bin/bash

echo -e 'Installing Boto3... \n'
which pip3
which python3
pip3 install -U boto3 botocore --user

Once the EMR is up, I add this step:

$ spark-submit --deploy-mode cluster s3://something/py-spark/simple.py

Simple.py is just like this:

import boto3
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('Simple test') \
    .getOrCreate()

spark.stop()

My step fails, with:

ModuleNotFoundError: No module named 'boto3'

I logged on the master node as hadoop and ran:

$ pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
click==8.1.3
docutils==0.14
jmespath==1.0.1
joblib==1.1.0
lockfile==0.11.0
lxml==4.9.1
mysqlclient==1.4.2
nltk==3.7
nose==1.3.4
numpy==1.20.0
py-dateutil==2.2
pystache==0.5.4
python-daemon==2.2.3
python37-sagemaker-pyspark==1.4.2
pytz==2022.2.1
PyYAML==5.4.1
regex==2021.11.10
simplejson==3.2.0
six==1.13.0
tqdm==4.64.0
windmill==1.6

Yet, in my bootstrap logs:

Installing Boto3... 

/usr/bin/pip3
/usr/bin/python3
Collecting boto3
  Downloading boto3-1.26.133-py3-none-any.whl (135 kB)
Collecting botocore
  Downloading botocore-1.29.133-py3-none-any.whl (10.7 MB)
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.1-py3-none-any.whl (79 kB)
Requirement already satisfied, skipping upgrade: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3) (1.0.1)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Collecting python-dateutil<3.0.0,>=2.1
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore) (1.13.0)
Installing collected packages: urllib3, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.26.133 botocore-1.29.133 python-dateutil-2.8.2 s3transfer-0.6.1 urllib3-1.26.15

And the log looks the same for all my nodes.

So on the master, as hadoop, I ran:

$ which python3
/bin/python3

Then, just to verify my bootstrap actually did something:

$ /usr/bin/pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
boto3==1.26.133
botocore==1.29.133
click==8.1.3
docutils==0.14

So the python3 I updated in the bootstrapper (/usr/bin/python3) is not the same that's used by default for hadoop.

Yet I tried to make sure pyspark uses the right "python" in my EMR configs:

{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}

But PYSPARK_PYTHON doesn't seem to be set on any of the nodes when I login. I do not understand why.

I am looking for the correct steps to follow to get my "import boto3" working from my pyspark script (I do not want to make changes to simple.py).


Update: it seems to work in client mode:

$ spark-submit --deploy-mode client s3://something/py-spark/simple.py

But of course, I want to run it in production, in cluster mode...

1 Answer 1

0

While this may not directly answer your question, I find using EMR CLI an easier way to package dependencies (imagine you need more than just boto3) and submit step to EMR (serverless or EC2).

Referencing the examples - Python build system, you should have the following folder structure after the emr init:

project_name
├── Dockerfile
├── simple.py
└── pyproject.toml

Next, edit pyproject.toml to include the dependencies:

dependencies = [
    'boto3==1.26.133'
]

Then, package the zip file:

emr package --entry-point simple.py 

Then, deploy to S3:

emr deploy \
    --entry-point simple.py \
    --s3-code-uri s3://xxxxx

Finally, submit the step:

emr run \
    --entry-point simple.py \
    --cluster-id xxx \
    --s3-code-uri s3://xxxxx
Sign up to request clarification or add additional context in comments.

2 Comments

I tried it and I get an error! emr run --entry-point simple.py --cluster-id j-3LEV5XXXX --s3-code-uri s3://something/tmp/ end up with: "RuntimeError: --show-stdout is not compatible with projects that make use of --archives."
Can you try a few things? 1/ add the --wait argument; 2/ if fails, downgrade to emr-cli 0.0.8.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.