18

I am running Pyspark 3.0.1 for Hadoop 2.7 in a Zeppelin notebook. In general all is well, however when I execute df.explain() on a DataFrame I get this error:

Fail to execute line 3: df.explain()
Traceback (most recent call last):
  File "/tmp/1610595392738-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 3, in <module>
  File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 356, in explain
    print(self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode))
TypeError: 'JavaPackage' object is not callable

Has anyone come across and resolved this error before in the context of explain?

My spark/jars folder contents:

activation-1.1.1.jar
aircompressor-0.10.jar
algebra_2.12-2.0.0-M2.jar
alluxio-2.4.1-client.jar
antlr4-runtime-4.7.1.jar
antlr-runtime-3.5.2.jar
aopalliance-1.0.jar
aopalliance-repackaged-2.6.1.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
arpack_combined_all-0.1.jar
arrow-format-0.15.1.jar
arrow-memory-0.15.1.jar
arrow-vector-0.15.1.jar
audience-annotations-0.5.0.jar
automaton-1.11-8.jar
avro-1.8.2.jar
avro-ipc-1.8.2.jar
avro-mapred-1.8.2-hadoop2.jar
bonecp-0.8.0.RELEASE.jar
breeze_2.12-1.0.jar
breeze-macros_2.12-1.0.jar
cats-kernel_2.12-2.0.0-M4.jar
chill_2.12-0.9.5.jar
chill-java-0.9.5.jar
commons-beanutils-1.9.4.jar
commons-cli-1.2.jar
commons-codec-1.10.jar
commons-collections-3.2.2.jar
commons-compiler-3.0.16.jar
commons-compress-1.8.1.jar
commons-configuration-1.6.jar
commons-crypto-1.0.0.jar
commons-dbcp-1.4.jar
commons-digester-1.8.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-lang3-3.9.jar
commons-logging-1.1.3.jar
commons-math3-3.4.1.jar
commons-net-3.1.jar
commons-pool-1.5.4.jar
commons-text-1.6.jar
compress-lzf-1.0.3.jar
core-1.1.2.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
datanucleus-api-jdo-4.2.4.jar
datanucleus-core-4.1.17.jar
datanucleus-rdbms-4.1.19.jar
derby-10.12.1.1.jar
dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
flatbuffers-java-1.9.0.jar
generex-1.0.2.jar
gson-2.2.4.jar
guava-14.0.1.jar
guice-3.0.jar
guice-servlet-3.0.jar
hadoop-annotations-2.7.4.jar
hadoop-auth-2.7.4.jar
hadoop-client-2.7.4.jar
hadoop-common-2.7.4.jar
hadoop-hdfs-2.7.4.jar
hadoop-mapreduce-client-app-2.7.4.jar
hadoop-mapreduce-client-common-2.7.4.jar
hadoop-mapreduce-client-core-2.7.4.jar
hadoop-mapreduce-client-jobclient-2.7.4.jar
hadoop-mapreduce-client-shuffle-2.7.4.jar
hadoop-yarn-api-2.7.4.jar
hadoop-yarn-client-2.7.4.jar
hadoop-yarn-common-2.7.4.jar
hadoop-yarn-server-common-2.7.4.jar
hadoop-yarn-server-web-proxy-2.7.4.jar
HikariCP-2.5.1.jar
hive-beeline-2.3.7.jar
hive-cli-2.3.7.jar
hive-common-2.3.7.jar
hive-exec-2.3.7-core.jar
hive-jdbc-2.3.7.jar
hive-llap-common-2.3.7.jar
hive-metastore-2.3.7.jar
hive-serde-1.2.1.spark2.jar
hive-serde-2.3.7.jar
hive-shims-0.23-2.3.7.jar
hive-shims-1.2.1.spark2.jar
hive-shims-2.3.7.jar
hive-shims-common-2.3.7.jar
hive-shims-scheduler-2.3.7.jar
hive-storage-api-2.7.1.jar
hive-vector-code-gen-2.3.7.jar
hk2-api-2.6.1.jar
hk2-locator-2.6.1.jar
hk2-utils-2.6.1.jar
htrace-core-3.1.0-incubating.jar
httpclient-4.5.6.jar
httpcore-4.4.12.jar
istack-commons-runtime-3.0.8.jar
ivy-2.4.0.jar
jackson-annotations-2.10.0.jar
jackson-core-2.10.0.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.10.0.jar
jackson-dataformat-yaml-2.10.0.jar
jackson-datatype-jsr310-2.10.3.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-jaxb-annotations-2.10.0.jar
jackson-module-paranamer-2.10.0.jar
jackson-module-scala_2.12-2.10.0.jar
jackson-xc-1.9.13.jar
jakarta.activation-api-1.2.1.jar
jakarta.annotation-api-1.3.5.jar
jakarta.inject-2.6.1.jar
jakarta.validation-api-2.0.2.jar
jakarta.ws.rs-api-2.1.6.jar
jakarta.xml.bind-api-2.3.2.jar
janino-3.0.16.jar
javassist-3.25.0-GA.jar
javax.inject-1.jar
javax.jdo-3.2.0-m3.jar
javax.servlet-api-3.1.0.jar
javolution-5.5.1.jar
jaxb-api-2.2.2.jar
jaxb-runtime-2.3.2.jar
jcl-over-slf4j-1.7.30.jar
jdo-api-3.0.1.jar
jersey-client-2.30.jar
jersey-common-2.30.jar
jersey-container-servlet-2.30.jar
jersey-container-servlet-core-2.30.jar
jersey-hk2-2.30.jar
jersey-media-jaxb-2.30.jar
jersey-server-2.30.jar
jetty-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
JLargeArrays-1.5.jar
jline-2.14.6.jar
joda-time-2.10.5.jar
jodd-core-3.5.2.jar
jpam-1.1.jar
json-1.8.jar
json4s-ast_2.12-3.6.6.jar
json4s-core_2.12-3.6.6.jar
json4s-jackson_2.12-3.6.6.jar
json4s-scalap_2.12-3.6.6.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
jta-1.1.jar
JTransforms-3.1.jar
jul-to-slf4j-1.7.30.jar
kryo-shaded-4.0.2.jar
kubernetes-client-4.9.2.jar
kubernetes-model-4.9.2.jar
kubernetes-model-common-4.9.2.jar
leveldbjni-all-1.8.jar
libfb303-0.9.3.jar
libthrift-0.12.0.jar
log4j-1.2.17.jar
logging-interceptor-3.12.6.jar
lz4-java-1.7.1.jar
machinist_2.12-0.6.8.jar
macro-compat_2.12-1.1.1.jar
mesos-1.4.0-shaded-protobuf.jar
metrics-core-4.1.1.jar
metrics-graphite-4.1.1.jar
metrics-jmx-4.1.1.jar
metrics-json-4.1.1.jar
metrics-jvm-4.1.1.jar
minlog-1.3.0.jar
netty-all-4.1.47.Final.jar
objenesis-2.5.1.jar
okhttp-3.12.6.jar
okio-1.15.0.jar
opencsv-2.3.jar
orc-core-1.5.10.jar
orc-mapreduce-1.5.10.jar
orc-shims-1.5.10.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.3.jar
paranamer-2.8.jar
parquet-column-1.10.1.jar
parquet-common-1.10.1.jar
parquet-encoding-1.10.1.jar
parquet-format-2.4.0.jar
parquet-hadoop-1.10.1.jar
parquet-jackson-1.10.1.jar
postgresql-42.2.14.jar
protobuf-java-2.5.0.jar
py4j-0.10.9.jar
pyrolite-4.30.jar
RoaringBitmap-0.7.45.jar
scala-collection-compat_2.12-2.1.1.jar
scala-compiler-2.12.10.jar
scala-library-2.12.10.jar
scala-parser-combinators_2.12-1.1.2.jar
scala-reflect-2.12.10.jar
scala-xml_2.12-1.2.0.jar
shapeless_2.12-2.3.3.jar
shims-0.7.45.jar
slf4j-api-1.7.30.jar
slf4j-log4j12-1.7.30.jar
snakeyaml-1.24.jar
snappy-java-1.1.7.5.jar
spark-catalyst_2.12-3.0.1.jar
spark-core_2.12-3.0.1.jar
spark-graphx_2.12-3.0.1.jar
spark-hive_2.12-3.0.1.jar
spark-hive-thriftserver_2.12-3.0.1.jar
spark-kubernetes_2.12-3.0.1.jar
spark-kvstore_2.12-3.0.1.jar
spark-launcher_2.12-3.0.1.jar
spark-mesos_2.12-3.0.1.jar
spark-mllib_2.12-3.0.1.jar
spark-mllib-local_2.12-3.0.1.jar
spark-network-common_2.12-3.0.1.jar
spark-network-shuffle_2.12-3.0.1.jar
spark-repl_2.12-3.0.1.jar
spark-sketch_2.12-3.0.1.jar
spark-sql_2.12-3.0.1.jar
spark-streaming_2.12-3.0.1.jar
spark-tags_2.12-3.0.1.jar
spark-tags_2.12-3.0.1-tests.jar
spark-unsafe_2.12-3.0.1.jar
spark-yarn_2.12-3.0.1.jar
spire_2.12-0.17.0-M1.jar
spire-macros_2.12-0.17.0-M1.jar
spire-platform_2.12-0.17.0-M1.jar
spire-util_2.12-0.17.0-M1.jar
ST4-4.0.4.jar
stax-api-1.0.1.jar
stax-api-1.0-2.jar
stream-2.9.6.jar
super-csv-2.2.0.jar
threeten-extra-1.5.0.jar
transaction-api-1.1.jar
univocity-parsers-2.9.0.jar
velocity-1.5.jar
xbean-asm7-shaded-4.15.jar
xercesImpl-2.12.0.jar
xml-apis-1.4.01.jar
xmlenc-0.52.jar
xz-1.5.jar
zjsonpatch-0.3.0.jar
zookeeper-3.4.14.jar
zstd-jni-1.4.4-3.jar

I gather the error is saying something might not be in my classpath but I cant think what that might be ...

1
  • show the code that you're using to create DataFrame Commented Jan 17, 2021 at 14:50

5 Answers 5

23

I ran into this same issue on AWS with EMR 6.2.0 (also Spark 3.0.1 coincidentally?) and jupyter notebooks. The issue appears to be related to how pyspark is initialized. Specifically, the py4j Java imports.

The following import is supposed to be executed while the notebook kernel is being initialized but seems to be skipped. You just need to run this once per session.

from py4j.java_gateway import java_import
java_import(spark._sc._jvm, "org.apache.spark.sql.api.python.*")

Now df.explain() works as expected.

For future reference - when you see 'JavaPackage' object is not callable, it often means that the target Java class was not found. Either the class doesn't exist or the expected import hasn't been called.

Sign up to request clarification or add additional context in comments.

4 Comments

Amazing. This fixed the issue as described above for me. I have EMR 6.5 and Spark 3.1.2. Is this a bug in EMR?
This is a bug in Livy which EMR uses under the hood. Livy seems dead since 2 years. With Scala 2.12.10 and Spark 3.0.3 everything works, but using newer Spark versions produce this error. I hacked together a Livy using Scala 2.12.15 to be able to use Spark 3.3 and I get the same problem, using this import fixes it.
I'm trying to impliment this in emr-containers to no avail (yet). It's a WIP between here and this source: aws.amazon.com/blogs/big-data/…
Is there any issue with calling this more than once? Any way to check if it necessary to call this?
3

For Mac/Linux users: edit the file ~/.bash_profile or ~/.zshrc to add environment variables:

export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*]"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=jupyter  
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Then reload them: source ~/.bash_profile or source ~/.zshrc.

2 Comments

After adding PYTHONPATH, all works! Thanks.
For Windows, there is no need to add SPARK -related environment variables, PYSPARK comes with it.
0

Pyspark loads the class using spark context jvm

sc_.jvm.com.package.Class

If the jar for the corresponding Class file is not supplied to zepplin or jupyter in its config, this error will be thrown.

Running from spark-submit

If your pyspark code requires additional jars add them to spark-submit with --jars option

e.g.

spark-submit pyspark-job.py --jars abc.jar,xyz.jar

This answer explains the mechanism of java classes being called from pyspark

Comments

0

I was having the same issue. I upgraded Spark to the latest version from 3.3.1 to 3.4.1 and the problem was resolved.

Assuming mac:

  1. Download the latest spark version -> https://spark.apache.org/downloads.html
  2. Untar the file
sudo tar -zxvf spark-xxxx.tgz
  1. Add SPARK_HOME to your .bash_profile/.zshrc file
SPARK_HOME=/path/to/the/exported/dir
  1. Add PATH to your .bash_profile/.zshrc file
PATH=$PATH:$SPARK_HOME/bin
  1. Restart terminal and IDE and rerun the App.

Comments

0

Hope below code will help you friends!

# Set the PySpark environment variables
import os
import sys
os.environ['JAVA_HOME'] = 'C:/Program Files/Java/jdk1.8.0_202'
os.environ['SPARK_HOME'] = "C:/spk_hadoop/spark-3.5.6-bin-hadoop3"
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON_OPTS'] = 'lab'
os.environ['SPARK_VERSION'] = '3.5'

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("PySparkExample").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df.show()

spark.stop()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.