So, here is my problem. I have a pyspark job stored in dbfs as I don't have access to databricks repo due to org policy and I cannot create a new cluster when creating a spark-submit job again due to org policy. Is their any way I can execute the pyspark job and pass parameters to it?
1 Answer
Unfortunately, Spark Submit task needs a new cluster. Depending on how your PySpark job is created, you can try following (see in the task type dropdown):
- Use
Python scripttask - it allows to get Python file from DBFS:
- Use
Python wheeltask - if your code is packaged as wheel file
Both of these tasks are supporting execution on the existing interactive cluster, but it will cost you more.
6 Comments
DexterMe
Hi Alex, thanks for your reply. The cluster that spark submit will create is temporary cluster? Does it cost less than this cluster?
Alex Ott
Yes, when you create a temporary cluster for job, than it's usually almost 2 times cheaper than interactive cluster (depends on the tier - standard vs. premium).
DexterMe
Thanks, so I tried reading the file from dbfs but when I read it into my pyspark code in notebook I get an error message as no such file or directory and then something like /local_disk0/spark-/userfiles-../dbfs:/<my folder>/<myfilename>. Why is it not finding the file?
Alex Ott
IT depends on how you do it…
DexterMe
Actually script I'm debugging is add file to hdfs cloudera nameservice path and I'm passing dbfs file path to the function, which is where its giving error.
|

