1

I am trying to submit a Pyspark job on ADLS Gen2 to Azure-Kubernetes-Services (AKS) and get the following exception:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:191)
    at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
    at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$6(SparkSubmit.scala:365)
    at scala.Option.map(Option.scala:230)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:365)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
    ... 27 more

My spark-submit looks like this:

$SPARK_HOME/bin/spark-submit \
--master k8s://https://XXX \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.file.upload.path=file:///tmp \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=XXX \
--conf spark.hadoop.fs.azure.account.auth.type.XXX.dfs.core.windows.net=SharedKey \
--conf spark.hadoop.fs.azure.account.key.XXX.dfs.core.windows.net=XXX \
--py-files abfss://[email protected]/py-files/ml_pipeline-0.0.1-py3.8.egg \
abfss://[email protected]/py-files/main_kubernetes.py

The job runs just fine on my VM and also loads data from ADLS Gen2 without problems. In this post java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found it is recommended to download the package and add it to the spark/jars folder. But I don't know where to download it and why it has to be included in the first place, if it works fine locally.

EDIT: So I managed to include the jars in the Docker container. And if I ssh into that container and run the Job it works fine and loads the files from the ADLS. But if I submit the job to Kubernetes it throws the same exception as before. Please, can someone help?

Spark 3.1.1, Python 3.8.5, Ubuntu 18.04

5
  • please try to use spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.0 to run. It will download package from maven. Commented Jun 3, 2021 at 6:41
  • Hi Jim, thank you for the comment. Unfortunately that doesn't work either. I get the following excepton: Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-373409f0-dc3b-40f1-a8a1-307e365b16a1-1.0.xml (No such file or directory) Commented Jun 3, 2021 at 8:42
  • please refer to stackoverflow.com/questions/66722861/… Commented Jun 3, 2021 at 8:47
  • I tried to manually include the jars but I run into dependency issues. How do I know which version of hadoop-azure to use and how can I download the jar including all dependencies? Commented Jun 3, 2021 at 9:53
  • The version should be same as your hadoop's version. Commented Jun 3, 2021 at 12:56

1 Answer 1

1

So I managed to fix my problem. It is definitely a workaround but it works.

I modified the PySpark Docker container by changing the entrypoint to:

ENTRYPOINT [ "/opt/entrypoint.sh" ]

Now I was able to run the container without it exiting immediately:

docker run -td <docker_image_id>

And could ssh into it:

docker exec -it <docker_container_id> /bin/bash

At this point I could submit the spark job inside the container with the --package flag:

$SPARK_HOME/bin/spark-submit \
  --master local[*] \
  --deploy-mode client \
  --name spark-python \
  --packages org.apache.hadoop:hadoop-azure:3.2.0 \
  --conf spark.hadoop.fs.azure.account.auth.type.user.dfs.core.windows.net=SharedKey \
  --conf spark.hadoop.fs.azure.account.key.user.dfs.core.windows.net=xxx \
  --files "abfss://[email protected]/config.yml" \
  --py-files "abfss://[email protected]/jobs.zip" \
  "abfss://[email protected]/main.py"

Spark then downloaded the required dependencies and saved them under /root/.ivy2 in the container and executed the job succesfully.

I copied the whole folder from the container onto the host machine:

sudo docker cp <docker_container_id>:/root/.ivy2/ /opt/spark/.ivy2/

And modified the Dockerfile again to copy the folder into the image:

COPY .ivy2 /root/.ivy2

Finally I could submit the job to Kubernetes with this newly build image and everything runs as expected.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.