1

I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:

/folder1
 - main.py
/utils
 - utils1.py
 - utils2.py

I use the following command:

spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py

But I get the error:

Traceback (most recent call last):
  File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
    from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'

What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.

UPD:

EMR cluster's controller log says, that launching command looks like this:

hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args

But now I have the following error: java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist

What's wrong?

4
  • 1
    "Now I need to do it without packaging" ... why? py-files supports egg and zip files, not folders, AFAIK Commented Oct 8, 2021 at 16:07
  • Because we just started the project and don't have CI yet. But if it's a bad practice in python, I'll put everything in zip file. Commented Oct 8, 2021 at 16:12
  • 1
    You don't need "CI". You could make a simple shell script that runs zip and spark-submit together Commented Oct 8, 2021 at 16:19
  • Also, note that EMR support should be able to provide you "best practices" for how to submit your code Commented Oct 8, 2021 at 16:20

1 Answer 1

4

Although not recommended (see the complete answer for a better option), but if you do not want to zip files. Instead of providing utils folder, you can provide individual utils-* files with py-files with comma-separated syntax before the actual file as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils/utils1.py,{your_s3_path_here}/utils/utils1.py',
                'main.py']
        }

Better to zip utils folder

You can zip utils and include like this

To do so, make empty __init__.py file at root level in utils like utils/__init__.py )

From outside this directory, make a zip of it (for example utils.zip)

For submission, you can add this zip as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils.zip',
                'main.py'
        }

Considering your have __init__.py , utils1.py, utils2.py in utils.zip

Note: You might also need to add this zip to sc with sc.addPyFile("utils.zip") before following imports

You can now use them as

from utils.utils1 import *
from utils.utils2 import *
Sign up to request clarification or add additional context in comments.

8 Comments

What do I have to specify here: sc.addPyFile("utils.zip")? Should I specify full path on s3, for example sc.addPyFile("/d1/d2/utils.zip") or just a file name is enough?
You wont need full path, just utils.zip
I'm still getting java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-3P1Q1C80AK1FP/utils.zip does not exist
Did you submit utils via '--py-files' '{your_s3_path_here}/utils.zip' like shown in answer?
Do you have access to cluster launching script or bootstrap script?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.