1

I am running pyspark application, v2.4.0, on Kubernetes, my spark application depends on numpy and tensorflow modules, please suggest the way to add these dependencies to Spark executors.

I have checked the documentation, we can include the remote dependencies using --py-files, --jars etc. but nothing mentioned about library dependencies.

1 Answer 1

1

Found the way to add the library dependencies to Spark applications on K8S, thought of sharing it here.

Mention the required dependencies installation commands in Dockerfile and rebuild the spark image, when we submit the spark job, new container will be instantiated with the dependencies as well.

Dokerfile (/{spark_folder_path}/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile) contents:

RUN apk add --no-cache python && \
    apk add --no-cache python3 && \
    python -m ensurepip && \
    python3 -m ensurepip && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    pip install numpy && \
    # Removed the .cache to save space
    rm -r /root/.cache
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.