1

I want to build a docker image for pyspark 3.0.1 with Hadoop 3.2.x. In the docker file, If I use pip install pyspark==3.0.1, it installs the pyspark 3.0 but hadoop is 2.7. Is there a way to achieve this or any example of docker file for same.

3
  • pip install pyspark doesn't "install hadoop" with the abilitty for actually running a Hadoop cluster. You also should use separate Docker services for running Spark and Hadoop anyway Commented Jan 15, 2021 at 17:41
  • @Kapil did you find a solution? OneCricketeer if I want to use it with AWS S3 (which depends on hadoop) - I want to use the latest hadoop version... Commented Feb 19, 2021 at 12:58
  • @ItayB added answer with the docker file. Commented Feb 25, 2021 at 10:02

1 Answer 1

3

I was able to create docker image with pyspark 3.0 and hadoop 3.2 with the this docker file. Please note that copy app.py /app/app.py is just to copy your code that you want to run.

FROM python:3.6-alpine3.10

ARG SPARK_VERSION=3.0.1
ARG HADOOP_VERSION_SHORT=3.2
ARG HADOOP_VERSION=3.2.0
ARG AWS_SDK_VERSION=1.11.375

RUN apk add --no-cache bash openjdk8-jre && \
  apk add --no-cache libc6-compat && \
  ln -s /lib/libc.musl-x86_64.so.1 /lib/ld-linux-x86-64.so.2 && \
  pip install findspark

# Download and extract Spark
RUN wget -qO- https://www-eu.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz | tar zx -C /opt && \
    mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT} /opt/spark

# To read IAM role for Fargate
 RUN echo spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper > /opt/spark/conf/spark-defaults.conf

# Add hadoop-aws and aws-sdk
RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -P /opt/spark/jars/ && \ 
    wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -P /opt/spark/jars/

ENV PATH="/opt/spark/bin:${PATH}"
ENV PYSPARK_PYTHON=python3
ENV PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.9-src.zip:${PYTHONPATH}"
# Define default command

RUN mkdir $SPARK_HOME/conf
RUN echo "SPARK_LOCAL_IP=127.0.0.1" > $SPARK_HOME/conf/spark-env.sh

#Copy python script for batch
copy app.py /app/app.py
# Define default command
CMD ["/bin/bash"]
Sign up to request clarification or add additional context in comments.

1 Comment

[Python docker] hub.docker.com/_/python [HADOOP_VERSION / HADOOP_VERSION_SHORT] downloads.apache.org/spark [AWS_SDK_VERSION] repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.