0

I followed these instructions and installed Apache Spark (PySpark) 2.3.1 on my machine which has the specifications:

  • Ubuntu 18.04
  • JDK 10
  • Python 3.6

When I create a SparkSession either indirectly by calling pyspark from the shell or by directly creating a session in my app with:

spark = pyspark.sql.SparkSession.builder.appName('test').getOrCreate()

I get the following exception:

Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
    at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
....
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3107)
    at java.base/java.lang.String.substring(String.java:1873)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
    ... 22 more
Traceback (most recent call last):
  File "/home/welshamy/tools/anaconda3/lib/python3.6/site-packages/pyspark/python/pyspark/shell.py", line 38, in <module>
    SparkContext._ensure_initialized()
  File "/home/welshamy/tools/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 292, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/home/welshamy/tools/anaconda3/lib/python3.6/site-packages/pyspark/java_gateway.py", line 93, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

If I'm using a Jupyter notebook, I also get this exception in the notebook:

Exception: Java gateway process exited before sending the driver its port number

All the solutions I found and followed [1,2,3] point toward environment variables definitions, but non of them worked for me.

2 Answers 2

2

PySpark 2.3.1 does not support JDK 10+. You need to install JDK 8 and set the JAVA_HOME environment variable to point to it.

If you are using Ubuntu (or *nix):

  1. Install JDK 8

    sudo apt-get install openjdk-8-jdk
    
  2. Add the following line to your ~/.bashrc file:

    export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
    

Under Windows, install JDK 8 and set JAVA_HOME.

Sign up to request clarification or add additional context in comments.

Comments

0

For macOS, I had to

  • Install Java 8, eg
brew install --cask adoptopenjdk/openjdk/adoptopenjdk8
  • add $JAVA_HOME to my ~/.zshrc
export JAVA_HOME='/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home'
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
  • remove any other jdk installs
ls /Library/Java/JavaVirtualMachines
sudo rm -rf NONJDK8.jdk

The 3rd bit is important! It did not work until I removed other-versioned jdks.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.