86

Is there a way to convert a Spark DF (not RDD) to a Pandas DF?

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.

10
  • 6
    You don't declare python variables using var Commented Jun 21, 2018 at 0:52
  • @user3483203 yep, I created the data frame in the note book with the Spark and Scala interpreter. and used '%pyspark' while trying to convert the DF into pandas DF. Commented Jun 21, 2018 at 1:04
  • 2
    why are you mixing scala and pyspark. just use one Commented Jun 21, 2018 at 3:06
  • @RameshMaharjan Yep I use scala. But I am trying to build visualizations for the columns in the Spark DF, for which I couldn't find relevant sources. Commented Jun 21, 2018 at 3:24
  • what kind of visualizations? Commented Jun 21, 2018 at 3:31

3 Answers 3

115

following should work

Sample DataFrame

    some_df = sc.parallelize([
     ("A", "no"),
     ("B", "yes"),
     ("B", "yes"),
     ("B", "no")]
     ).toDF(["user_id", "phone_number"])

Converting DataFrame to Pandas DataFrame

    pandas_df = some_df.toPandas()
Sign up to request clarification or add additional context in comments.

4 Comments

The toDF(...) of the answer is a red herring and should be removed for clarity, IMO. It's already present in the question. That is why I've updated the below answer instead.
what "sc" stands for in this case?
@Gabriel it's spark context
Thank you for the answer. Have tried applying this to my code on pySpark 3.2.0 and I get an error, that a second parameter, c is now required for function parallelize based on <spark.apache.org/docs/latest/api/python/reference/api/…>. Tried to add a constant c with example_df = sc\ .parallelize([ ("A", "no"), ("B", "yes"), ("B", "yes"), ("B", "no")], c=4)\ .toDF( ["user_id", "phone_number"] ) to get another error: AttributeError: 'list' object has no attribute 'defaultParallelism'
42

In my case the following conversion from spark dataframe to pandas dataframe worked:

pandas_df = spark_df.select("*").toPandas()

4 Comments

there is no need to put select("*") on df unless you want some specific columns. This is not going to affect the performance as it's lazy execution and not gonna do anything.
For some reason, the solution from @Inna was the only one that worked on my dataframe. No conversion was possible except with selecting all columns beforehand. The data type was the same as usually, but I had previously applied a UDF.
I am using this but most of my spark decimal columns are converting to object in pandas instead of float. I have 100+ columns. Is there a way this type casting can be modified?
You can write a function and type cast it
16

Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pd_df = df_spark.toPandas()

I have tried this in DataBricks.

3 Comments

The spark.sql.execution.arrow.enabled option is highly recommended, especially with pyspark.pandas in the upcoming spark 3.2 release.
The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
Can you please explain why it makes more efficient?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.