How do I create a seaborn line plot for PySpark dataframe?

Question

I have a data frame with three columns and I am trying to do a line plot using Seaborn library but it throws me an error saying that 'DataFrame' object has no attribute 'get'. Here is my test data frame

Age variable    value
31  Overall 69.76751118
31  Potential   69.76751118
31  Growth  0
34  Overall 68.91176471
34  Potential   68.91176471
34  Growth  0
28  Overall 69.05803996
28  Potential   69.05803996
28  Growth  0.24643197

This is what I am trying to do using the seaborn line plot after reading in the csv file

test = spark.read.csv("test.csv", inferSchema=True, header=True)
sns.lineplot(x = "Age", y = "value", hue = "variable", data = test)

And the error that I get is this

AttributeError: 'DataFrame' object has no attribute 'get'

However when I convert the data frame to Pandas data frame and use exactly the same seaborn code it works

test_df = test.toPandas()
sns.lineplot(x = "Age", y = "value", hue = "variable", data = test_df)

Am I doing anything wrong with Spark Data frames.

Not possible- you need to convert to a pandas DataFrame first, which is going to be expensive. — pault
– pault, Commented Nov 1, 2018 at 16:17
Is there an alternative other than converting to Pandas Dataframe? — upendra
– upendra, Commented Nov 1, 2018 at 18:15
Any solution will require the data to be on your local machine, which involves a collect type operation. — pault
– pault, Commented Nov 1, 2018 at 18:18
Ah I see. I will give it a try and see if that works. Thanks.. — upendra
– upendra, Commented Nov 1, 2018 at 18:42

Maviles · Accepted Answer · 2019-05-05 12:06:20Z

10

A spark dataframe and a pandas dataframe, despite sharing a lot of the same functionalities, differ on where and how they allocate data.

This step is correct:

test_df = test.toPandas()

You will always need to collect the data before you can use it to plot with seaborn (or even matplotlib)

answered May 5, 2019 at 12:06

Maviles

3,4794 gold badges28 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Nandeesh Over a year ago

Be very careful while doing this as this will pull all the data to driver node. On large dataset it might cause OOM.

Pablo Boswell Over a year ago

@Nandeesh Exactly! This makes distributed spark paradigm pointless...what's the solution?

Maviles Over a year ago

@PabloBoswell, the problem is that the data reduction generally is done inside the plotting library. So the solution is, instead of downloading millions of rows of data and plotting a histogram, you do the data reduction in spark and create the exactly same view using a bar plot and downloading only 10 rows of data from spark. Despite simple, I never found a python package dedicated to it, ready for histograms, density, etc.

Collectives™ on Stack Overflow

How do I create a seaborn line plot for PySpark dataframe?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related