1

I am trying to create and analyze dataframe in PySpark and in Notebook.

Below are my codes in Jupyter Notebook.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
   .master("local") \
   .appName("Neural Network Model") \
   .config("spark.executor.memory", "6gb") \
   .getOrCreate()

I was able to start Spark Session.

df1 = spark.createDataFrame([('John', 56, 80)])
print(df1.dtypes)
print(df1)
print(df1.show())

I was able to create df1, dataframe, but Somehow, I got error message when I am trying to use data frame function in df1.show()

Py4JJavaError Traceback (most recent call last) in 2 print(df1.dtypes) 3 print(df1) ----> 4 print(df1.show())

Py4JJavaError: An error occurred while calling o501.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 22, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Could you help me to fix this issues? I am not sure if it is system issue or my codes.

Thanks!!!

1 Answer 1

1

df1.show() just show the content of dataframe. It's a function that returns Unit (it does not return a value). So print(df1.show()) would fail (in Databricks notebook returns None)

If you want to see the content of df1, just need to do

df1.show()

without print()

This is actually the implementation of show():

def show(): Unit = show(20)

def show(numRows: Int): Unit = show(numRows, truncate = true)

def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
   println(showString(numRows, truncate = 20))
 } else {
   println(showString(numRows, truncate = 0))
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your your feedback. I also tried df1.show(). I got the same error message. I also check if it is only for .show(). When I tried df1.collect(), I also got the same error messages. I am concerned that it is due to some system set up. I set up the Spark environment using jdk1.8.0_201, and spark-2.4.0-bin-hadoop2.7. And I integrated Spark with Jupyter notebook.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.