1

Solution-

Add following env variables to the container

export PYSPARK_PYTHON=/usr/bin/python3.9

export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9


Trying to create a spark container and spark-submit a pyspark script.

I am able to create the container but running the pyspark script throws the following error:

Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

Questions :

  1. Any idea why this error is occurring ?
  2. Do i need to install python separately or does it comes bundled with spark download ?
  3. Do i need to install Pyspark separately or does it comes bundled with spark download ?
  4. What is preferable regarding python installation? download and put it under /opt/python or use apt-get ?

pyspark script:

from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
   ["scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)

output of spark-submit:

newuser@c1f28230da16:~$ spark-submit count.py

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting
this to the maintainers of org.apache.spark.unsafe.Platform WARNING:
Use --illegal-access=warn to enable warnings of further illegal
reflective access operations WARNING: All illegal access operations
will be denied in a future release 21/02/01 19:58:35 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Exception in
thread "main" java.io.IOException: Cannot run program "python":
error=2, No such file or directory    at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)    at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)    at
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)     at
org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)     at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)   at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.base/java.lang.reflect.Method.invoke(Method.java:564)   at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
  at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
  at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
  at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: java.io.IOException: error=2, No such file or directory   at
java.base/java.lang.ProcessImpl.forkAndExec(Native Method)    at
java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319)  at
java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250)   at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
  ... 15 more log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
initialize the log4j system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

output of printenv:

newuser@c1f28230da16:~$ printenv

HOME=/home/newuser
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip
TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin
_=/usr/bin/printenv

myspark dockerfile:

JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG
SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz


#MAINTAINER [email protected]
#LABEL maintainer="[email protected]"


############################################
###  Install openjava
############################################

# Base image stage 1 FROM ubuntu as jdk

ARG JAVA_HOME ARG JDK_PACKAGE

WORKDIR /opt/

## download open java
#  ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE
/
#  ADD $JDK_PACKAGE / COPY $JDK_PACKAGE .

RUN mkdir -p $JAVA_HOME/ && \
    tar -zxf $JDK_PACKAGE --strip-components 1  -C $JAVA_HOME  && \
    rm -f $JDK_PACKAGE


############################################
###  Install spark search
############################################

# Base image stage 2 From ubuntu as spark

#ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE

WORKDIR /opt/

## download spark COPY $SPARK_PACKAGE .

RUN mkdir -p $SPARK_HOME/  && \
    tar -zxf $SPARK_PACKAGE --strip-components 1  -C $SPARK_HOME  && \
    rm -f $SPARK_PACKAGE

# Mount elasticsearch.yml config
### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml

############################################
###  final
############################################

From ubuntu as finalbuild

ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE

WORKDIR /opt/

# get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME   $SPARK_HOME

# Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME

# setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH
$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip




# Expose ports
# EXPOSE 9200
# EXPOSE 9300

# Define mountable directories.
#VOLUME ["/data"]


## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash  && \
    echo 'newuser:newpassword' | chpasswd && \
    chown -R newuser $SPARK_HOME $JAVA_HOME  && \
    chown -R newuser:newuser /home/newuser && \
    chmod 755 /home/newuser
    #chown -R newuser:newuser /home/newuser
    #chown -R newuser /home/newuser  && \

# Install Python RUN apt-get update && \
    apt-get install -yq curl  && \
    apt-get install -yq vim  && \
    apt-get install -yq  python3.9



## Install PySpark and Numpy
#RUN \
#    pip install --upgrade pip && \
#    pip install numpy && \
#    pip install pyspark
#

USER newuser

WORKDIR /home/newuser

# RUN  chown -R newuser /home/newuser
2
  • Looks like you have been able to answer your own question? If so, you can provide the solution as an answer. See stackoverflow.com/help/self-answer Commented Feb 2, 2021 at 10:54
  • Done, need to wait another 14 hours before selecting it as the working solution. Commented Feb 3, 2021 at 5:57

1 Answer 1

2

Added following env variables to the container and it works

export PYSPARK_PYTHON=/usr/bin/python3.9

export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.