8

I am new to the world of Spark and Kubernetes. I built a Spark docker image using the official Spark 3.0.1 bundled with Hadoop 3.2 using the docker-image-tool.sh utility.

I have also created another docker image for Jupyter notebook and am trying to run spark on Kubernetes in client mode. I first run my Jupyter notebook as a pod, do a port forward using kubectl and access the notebook UI from my system at localhost:8888 . All seems to be working fine. I am able to run commands successfully from the notebook.

Now I am trying to access Azure Data Lake Gen2 from my notebook using Hadoop ABFS connector. I am setting the Spark context as below.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# Create Spark config for our Kubernetes based cluster manager


sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")
sparkConf.setAppName("spark")
sparkConf.set("spark.kubernetes.container.image", "<<my_repo>>/spark-py:latest")
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.driver.memory", "512m")
sparkConf.set("spark.executor.memory", "512m")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
sparkConf.set("spark.driver.port", "29413")
sparkConf.set("spark.driver.host", "my-notebook-deployment.spark.svc.cluster.local")

sparkConf.set("fs.azure.account.auth.type", "SharedKey")
sparkConf.set("fs.azure.account.key.<<storage_account_name>>.dfs.core.windows.net","<<account_key>>")

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()

And then I am running the below command to read a csv file present in the ADLS location

df = spark.read.csv("abfss://<<container>>@<<storage_account>>.dfs.core.windows.net/")

On runnining it I am getting the error Py4JJavaError: An error occurred while calling o443.csv. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found

After some research, I found that I would have to explicitly include the hadoop-azure jar for the appropriate classes to be available. I downloaded the jar from here , put it in the /spark-3.0.1-bin-hadoop3.2/jars folder and built the image again.

Unfortunately I am still getting this error. I manually verified that the jar file is indeed present in the docker image and contains the class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem

I looked at the entrypoint.sh present at spark-3.0.1-bin-hadoop3.2\kubernetes\dockerfiles\spark folder which is the entry point of our spark docker image. It adds all the packages present in the spark-3.0.1-bin-hadoop3.2\jar\ folder in the class path.

# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
if [ -n "${HADOOP_HOME}"  ] && [ -z "${SPARK_DIST_CLASSPATH}"  ]; then
  export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
fi

if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
  SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
fi

According to my understanding spark should be able to find the class in its classpath with any addition setJar configuration.

Can someone please guide me how to resolve this? I might be missing something quite basic here.

1
  • I found this useful and it's resolved my issue local after put hadoop-azure and azure-storage jar in install spark location in C:\Spark\jar\ in this folder Commented Jan 30, 2023 at 6:19

2 Answers 2

8

Applying the solution provided here...

How do we specify maven dependencies in pyspark

We can start a Spark session and include the required Jar from Maven.

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local[*]")\
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1')\
        .getOrCreate()
Sign up to request clarification or add additional context in comments.

Comments

1

Looks like I needed to add the hadoop-azure package in the Docker image which ran Jupyter notebook and acted as Spark driver. Its working as expected after doing that.

1 Comment

Hello Ali, I am facing the same Problem right now. Did you only include the hadoop-azure package in the jar folder? And did you have to modify the name of the package or any additional steps? Unfortunately it doesn't work for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.