EMR - Pyspark, No module named 'boto3'

Question

I am running an EMR with the following creation statement:

$ aws emr create-cluster \
 --name "my_cluster" \
 --log-uri "s3n://somebucket/" \
 --release-label "emr-6.8.0" \
 --service-role "arn:aws:iam::XXXXXXXXXX:role/EMR_DefaultRole" \
 --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","KeyName":"some_key","AdditionalMasterSecurityGroups":[],"AdditionalSlaveSecurityGroups":[],"ServiceAccessSecurityGroup":"sg-xxxxxxxx","SubnetId":"subnet-xxxxxxxx"}' \
 --applications Name=Spark Name=Zeppelin \
 --configurations '[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]' \
 --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","Name":"Core","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]},{"InstanceCount":1,"InstanceGroupType":"MASTER","Name":"Primary","InstanceType":"r6g.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]},"Configurations":[{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}]' \
 --bootstrap-actions '[{"Args":[],"Name":"install python package","Path":"s3://something/bootstrap/bootstrap-script.sh"}]' \
 --scale-down-behavior "TERMINATE_AT_TASK_COMPLETION" \
 --auto-termination-policy '{"IdleTimeout":3600}' \
 --step-concurrency-level "3" \
 --os-release-label "2.0.20230418.0" \
 --region "us-east-1"

My bootstrap script (bootstrap-script.sh):

#!/bin/bash

echo -e 'Installing Boto3... \n'
which pip3
which python3
pip3 install -U boto3 botocore --user

Once the EMR is up, I add this step:

$ spark-submit --deploy-mode cluster s3://something/py-spark/simple.py

Simple.py is just like this:

import boto3
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('Simple test') \
    .getOrCreate()

spark.stop()

My step fails, with:

ModuleNotFoundError: No module named 'boto3'

I logged on the master node as hadoop and ran:

$ pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
click==8.1.3
docutils==0.14
jmespath==1.0.1
joblib==1.1.0
lockfile==0.11.0
lxml==4.9.1
mysqlclient==1.4.2
nltk==3.7
nose==1.3.4
numpy==1.20.0
py-dateutil==2.2
pystache==0.5.4
python-daemon==2.2.3
python37-sagemaker-pyspark==1.4.2
pytz==2022.2.1
PyYAML==5.4.1
regex==2021.11.10
simplejson==3.2.0
six==1.13.0
tqdm==4.64.0
windmill==1.6

Yet, in my bootstrap logs:

Installing Boto3... 

/usr/bin/pip3
/usr/bin/python3
Collecting boto3
  Downloading boto3-1.26.133-py3-none-any.whl (135 kB)
Collecting botocore
  Downloading botocore-1.29.133-py3-none-any.whl (10.7 MB)
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.1-py3-none-any.whl (79 kB)
Requirement already satisfied, skipping upgrade: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3) (1.0.1)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
Collecting python-dateutil<3.0.0,>=2.1
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore) (1.13.0)
Installing collected packages: urllib3, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.26.133 botocore-1.29.133 python-dateutil-2.8.2 s3transfer-0.6.1 urllib3-1.26.15

And the log looks the same for all my nodes.

So on the master, as hadoop, I ran:

$ which python3
/bin/python3

Then, just to verify my bootstrap actually did something:

$ /usr/bin/pip3 freeze
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.3
boto==2.49.0
boto3==1.26.133
botocore==1.29.133
click==8.1.3
docutils==0.14

So the python3 I updated in the bootstrapper (/usr/bin/python3) is not the same that's used by default for hadoop.

Yet I tried to make sure pyspark uses the right "python" in my EMR configs:

{"Classification":"spark-env","Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3"}}],"Properties":{}}]}

But PYSPARK_PYTHON doesn't seem to be set on any of the nodes when I login. I do not understand why.

I am looking for the correct steps to follow to get my "import boto3" working from my pyspark script (I do not want to make changes to simple.py).

Update: it seems to work in client mode:

$ spark-submit --deploy-mode client s3://something/py-spark/simple.py

But of course, I want to run it in production, in cluster mode...

lsc · Accepted Answer · 2023-05-12 16:28:46Z

0

While this may not directly answer your question, I find using EMR CLI an easier way to package dependencies (imagine you need more than just boto3) and submit step to EMR (serverless or EC2).

Referencing the examples - Python build system, you should have the following folder structure after the emr init:

project_name
├── Dockerfile
├── simple.py
└── pyproject.toml

Next, edit pyproject.toml to include the dependencies:

dependencies = [
    'boto3==1.26.133'
]

Then, package the zip file:

emr package --entry-point simple.py

Then, deploy to S3:

emr deploy \
    --entry-point simple.py \
    --s3-code-uri s3://xxxxx

Finally, submit the step:

emr run \
    --entry-point simple.py \
    --cluster-id xxx \
    --s3-code-uri s3://xxxxx

answered May 12, 2023 at 16:28

lsc

3681 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Flo Over a year ago

I tried it and I get an error! emr run --entry-point simple.py --cluster-id j-3LEV5XXXX --s3-code-uri s3://something/tmp/ end up with: "RuntimeError: --show-stdout is not compatible with projects that make use of --archives."

lsc Over a year ago

Can you try a few things? 1/ add the --wait argument; 2/ if fails, downgrade to emr-cli 0.0.8.

Collectives™ on Stack Overflow

EMR - Pyspark, No module named 'boto3'

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related