How to fix DataFrame function issues in PySpark - Py4JJavaError

Question

I am trying to create and analyze dataframe in PySpark and in Notebook.

Below are my codes in Jupyter Notebook.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
   .master("local") \
   .appName("Neural Network Model") \
   .config("spark.executor.memory", "6gb") \
   .getOrCreate()

I was able to start Spark Session.

df1 = spark.createDataFrame([('John', 56, 80)])
print(df1.dtypes)
print(df1)
print(df1.show())

I was able to create df1, dataframe, but Somehow, I got error message when I am trying to use data frame function in df1.show()

Py4JJavaError Traceback (most recent call last) in 2 print(df1.dtypes) 3 print(df1) ----> 4 print(df1.show())

Py4JJavaError: An error occurred while calling o501.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 22, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Could you help me to fix this issues? I am not sure if it is system issue or my codes.

Thanks!!!

mayank agrawal · Accepted Answer · 2019-01-25 09:54:26Z

1

df1.show() just show the content of dataframe. It's a function that returns Unit (it does not return a value). So print(df1.show()) would fail (in Databricks notebook returns None)

If you want to see the content of df1, just need to do

df1.show()

without print()

This is actually the implementation of show():

def show(): Unit = show(20)

def show(numRows: Int): Unit = show(numRows, truncate = true)

def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
   println(showString(numRows, truncate = 20))
 } else {
   println(showString(numRows, truncate = 0))
}

edited Jan 25, 2019 at 9:54

mayank agrawal

2,5552 gold badges16 silver badges33 bronze badges

answered Jan 25, 2019 at 9:38

pedvaljim

1087 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Klee Over a year ago

Thanks for your your feedback. I also tried df1.show(). I got the same error message. I also check if it is only for .show(). When I tried df1.collect(), I also got the same error messages. I am concerned that it is due to some system set up. I set up the Spark environment using jdk1.8.0_201, and spark-2.4.0-bin-hadoop2.7. And I integrated Spark with Jupyter notebook.

Collectives™ on Stack Overflow

How to fix DataFrame function issues in PySpark - Py4JJavaError

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related