11

I am fighting it the whole day. I am able to install and to use a package (graphframes) with spark shell or a connected Jupiter notebook, but I would like to move it to the kubernetes based spark environment with spark-submit. My spark version: 3.0.1 I downloaded the last available .jar file (graphframes-0.8.1-spark3.0-s_2.12.jar) from spark-packages and put it to the jars folder. I use a variation of standard spark docker file to build my images. My spark-submit command looks like:

$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=$2 \
--conf spark.kubernetes.container.image=myimage.io/repositorypath \
--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
--jars "local:///opt/spark/jars/graphframes-0.8.1-spark3.0-s_2.12.jar" \
path/to/my/script/script.py

But it ends with an error:

Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5;1.0
    confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5-1.0.xml (No such file or directory)

Here are my logs just for the case:

(base) konstantinigin@Konstantins-MBP spark-3.0.1-bin-hadoop3.2 % kubectl logs scalableapp-py-7669dd784bd59f67-driver
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.7.3'
+ export PYTHON_VERSION=3.7.3
+ PYTHON_VERSION=3.7.3
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.1.2.145 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/data/ScalableApp.py --number_of_executors 2 --dataset USAir --links 100
Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5;1.0
    confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5-1.0.xml (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
    at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
    at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
    at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
    at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
    at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
    at org.apache.ivy.Ivy.resolve(Ivy.java:523)
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Did someone have something familiar? May be you have an idea what am I doing wrong here?

4 Answers 4

18

Adding this configuration with spark submit worked for me:

spark-submit \
 --conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" \
Sign up to request clarification or add additional context in comments.

Comments

1

It seems to be a known spark issue that is getting resolved

https://github.com/apache/spark/pull/32397

Comments

1

I managed to solve a similar problem where I wasn't able to download the hadoop-azure jars with the --package flag. It is definitely a workaround but it works.

I modified the PySpark Docker container by changing the entrypoint to:

ENTRYPOINT [ "/opt/entrypoint.sh" ]

Now I was able to run the container without it exiting immediately:

docker run -td <docker_image_id>

And could ssh into it:

docker exec -it <docker_container_id> /bin/bash

At this point I could submit the spark job inside the container with the --package flag:

$SPARK_HOME/bin/spark-submit \
  --master local[*] \
  --deploy-mode client \
  --name spark-python \
  --packages org.apache.hadoop:hadoop-azure:3.2.0 \
  --conf spark.hadoop.fs.azure.account.auth.type.user.dfs.core.windows.net=SharedKey \
  --conf spark.hadoop.fs.azure.account.key.user.dfs.core.windows.net=xxx \
  --files "abfss://[email protected]/config.yml" \
  --py-files "abfss://[email protected]/jobs.zip" \
  "abfss://[email protected]/main.py"

Spark then downloaded the required dependencies and saved them under /root/.ivy2 in the container and executed the job succesfully.

I copied the whole folder from the container onto the host machine:

sudo docker cp <docker_container_id>:/root/.ivy2/ /opt/spark/.ivy2/

And modified the Dockerfile again to copy the folder into the image:

COPY .ivy2 /root/.ivy2

Finally I could submit the job to Kubernetes with this newly build image and everything runs as expected.

1 Comment

this trick is so good. I am not working with k8s but this helps me rebuild my docker image and do not need to download dependencies when I create new container and run spark job.
0

Okay, I solved my issue. Not sure whether it is going to work for other packages, but it lets me run graphframes in the mentioned setup:

  1. Download the latest .jar file from spark-packages
  2. Remove version-part of its name, leaving only the package name. In my case it was:
mv ./graphframes-0.8.1-spark3.0-s_2.12.jar ./graphframes.jar
  1. Unpack it using the jar command:
# Extract jar contents
jar xf graphframes.jar

Now here comes the first point. I put all the packages I use in one dependencies folder that I later submit to kubernetes in zipped form. The logic behind this folder is explained in another question of mine that I again answered myself. See here. Now here I copy the graphframes folder from contents extracted in previous step using the jar command to my dependencies folder: 4. Copy graphframes folder from contents extracted before to your dependencies folder

cp -r ./graphframes $SPARK_HOME/path/to/your/dependencies
  1. Add the original .jar file to the jars folder inside of your $SPARK_HOME
  2. Include --jars to your spark-submit command pointing at the new .jar file:
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=$2 \
--conf spark.kubernetes.container.image=docker.io/path/to/your/image \
--jars "local:///opt/spark/jars/graphframes.jar" \ ...
  1. Include your dependencies as described here

I am in a hurry right now, but in a nearest future I will edit this post, adding a link to a short medium article about handling dependencies in py-spark. Hope that it is going to be useful to someone :)

4 Comments

Did you ever find a solution that allows the use of the --packages flag? This bug is currently affecting me as well
--packages did not work for me. I believe there is a problem with this Spark package manager, Ivy.
How would you handle this situation for a bunch of dependencies? My Pyspark job relies on the org.apache.hadoop:hadoop-azure:3.2.0 package, which has a dozen dependencies. I cannot supply all of them manually. The weird thing is also, that is works in local mode...so the packages have to be there already somewhere.
I also don't completely understand the 7. step. Do you supply the zipped jars as --py-files?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.