I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:
/folder1
- main.py
/utils
- utils1.py
- utils2.py
I use the following command:
spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py
But I get the error:
Traceback (most recent call last):
File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'
What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.
UPD:
EMR cluster's controller log says, that launching command looks like this:
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args
But now I have the following error:
java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist
What's wrong?
py-filessupports egg and zip files, not folders, AFAIKzipandspark-submittogether