How to include external python modules with pyspark

Question

I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:

/folder1
 - main.py
/utils
 - utils1.py
 - utils2.py

I use the following command:

spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py

But I get the error:

Traceback (most recent call last):
  File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
    from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'

What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.

UPD:

EMR cluster's controller log says, that launching command looks like this:

hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args

But now I have the following error: java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist

What's wrong?

"Now I need to do it without packaging" ... why? py-files supports egg and zip files, not folders, AFAIK — OneCricketeer
– OneCricketeer, Commented Oct 8, 2021 at 16:07
Because we just started the project and don't have CI yet. But if it's a bad practice in python, I'll put everything in zip file. — Vladimir Shadrin
– Vladimir Shadrin, Commented Oct 8, 2021 at 16:12
You don't need "CI". You could make a simple shell script that runs zip and spark-submit together — OneCricketeer
– OneCricketeer, Commented Oct 8, 2021 at 16:19
Also, note that EMR support should be able to provide you "best practices" for how to submit your code — OneCricketeer
– OneCricketeer, Commented Oct 8, 2021 at 16:20

A.B · Accepted Answer · 2021-10-09 00:32:43Z

4

Although not recommended (see the complete answer for a better option), but if you do not want to zip files. Instead of providing utils folder, you can provide individual utils-* files with py-files with comma-separated syntax before the actual file as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils/utils1.py,{your_s3_path_here}/utils/utils1.py',
                'main.py']
        }

Better to zip utils folder

You can zip utils and include like this

To do so, make empty __init__.py file at root level in utils like utils/__init__.py )

From outside this directory, make a zip of it (for example utils.zip)

For submission, you can add this zip as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils.zip',
                'main.py'
        }

Considering your have __init__.py , utils1.py, utils2.py in utils.zip

Note: You might also need to add this zip to sc with sc.addPyFile("utils.zip") before following imports

You can now use them as

from utils.utils1 import *
from utils.utils2 import *

edited Oct 9, 2021 at 0:32

answered Oct 9, 2021 at 0:26

A.B

20.5k3 gold badges43 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Vladimir Shadrin Over a year ago

What do I have to specify here: sc.addPyFile("utils.zip")? Should I specify full path on s3, for example sc.addPyFile("/d1/d2/utils.zip") or just a file name is enough?

A.B Over a year ago

You wont need full path, just utils.zip

Vladimir Shadrin Over a year ago

I'm still getting java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/s-3P1Q1C80AK1FP/utils.zip does not exist

A.B Over a year ago

Did you submit utils via '--py-files' '{your_s3_path_here}/utils.zip' like shown in answer?

A.B Over a year ago

Do you have access to cluster launching script or bootstrap script?

|

Collectives™ on Stack Overflow

How to include external python modules with pyspark

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related