Convert a Spark DataFrame to Pandas DF

Question

Is there a way to convert a Spark DF (not RDD) to a Pandas DF?

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.

@user3483203 yep, I created the data frame in the note book with the Spark and Scala interpreter. and used '%pyspark' while trying to convert the DF into pandas DF. — data_person
– data_person, Commented Jun 21, 2018 at 1:04
@RameshMaharjan Yep I use scala. But I am trying to build visualizations for the columns in the Spark DF, for which I couldn't find relevant sources. — data_person
– data_person, Commented Jun 21, 2018 at 3:24

Gaurang Shah · Accepted Answer · 2022-09-22 15:57:28Z

115

following should work

Sample DataFrame

    some_df = sc.parallelize([
     ("A", "no"),
     ("B", "yes"),
     ("B", "yes"),
     ("B", "no")]
     ).toDF(["user_id", "phone_number"])

Converting DataFrame to Pandas DataFrame

    pandas_df = some_df.toPandas()

edited Sep 22, 2022 at 15:57

answered Jun 21, 2018 at 1:43

Gaurang Shah

13.1k15 gold badges96 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ijoseph Over a year ago

The toDF(...) of the answer is a red herring and should be removed for clarity, IMO. It's already present in the question. That is why I've updated the below answer instead.

Gabriel Over a year ago

what "sc" stands for in this case?

Gaurang Shah Over a year ago

@Gabriel it's spark context

Curious Watcher Over a year ago

Thank you for the answer. Have tried applying this to my code on pySpark 3.2.0 and I get an error, that a second parameter, c is now required for function parallelize based on <spark.apache.org/docs/latest/api/python/reference/api/…>. Tried to add a constant c with

example_df = sc\     .parallelize([ ("A", "no"), ("B", "yes"), ("B", "yes"), ("B", "no")], c=4)\     .toDF( ["user_id", "phone_number"] )

to get another error: AttributeError: 'list' object has no attribute 'defaultParallelism'

Inna · Accepted Answer · 2019-12-16 14:47:24Z

42

In my case the following conversion from spark dataframe to pandas dataframe worked:

pandas_df = spark_df.select("*").toPandas()

edited Dec 16, 2019 at 14:47

answered Jul 22, 2019 at 13:59

Inna

7131 gold badge8 silver badges13 bronze badges

4 Comments

Gaurang Shah Over a year ago

there is no need to put select("*") on df unless you want some specific columns. This is not going to affect the performance as it's lazy execution and not gonna do anything.

DataBach Over a year ago

For some reason, the solution from @Inna was the only one that worked on my dataframe. No conversion was possible except with selecting all columns beforehand. The data type was the same as usually, but I had previously applied a UDF.

Resham Wadhwa Over a year ago

I am using this but most of my spark decimal columns are converting to object in pandas instead of float. I have 100+ columns. Is there a way this type casting can be modified?

Scope Over a year ago

You can write a function and type cast it

Jaimil Patel · Accepted Answer · 2020-04-30 11:15:02Z

16

Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pd_df = df_spark.toPandas()

I have tried this in DataBricks.

edited Apr 30, 2020 at 11:15

Jaimil Patel

1,3639 silver badges13 bronze badges

answered Apr 29, 2020 at 9:12

Shikha

2373 silver badges7 bronze badges

3 Comments

RndmSymbl Over a year ago

The spark.sql.execution.arrow.enabled option is highly recommended, especially with pyspark.pandas in the upcoming spark 3.2 release.

Gangadhar Kadam Over a year ago

The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.

notilas Over a year ago

Can you please explain why it makes more efficient?

Collectives™ on Stack Overflow

Convert a Spark DataFrame to Pandas DF

3 Answers 3

4 Comments

4 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related