Displaying Dataframe data using a bar chart spark Pandas

Question

I have the following dataframe df.

root
 |-- id: long (nullable = false)
 |-- subject: string (nullable = true)
 |-- Marks: long (nullable = true)
 |-- year: long (nullable = true)

And I want to draw a bar chart using the columns subject, marks and year. For each I want to see how marks for each subject is scored. I am unable to figure out how I can use three or more columns to draw a bar chart. I tried the below code to try mapping all three columns. Is this the correct way?

 barchartPandas = df.toPandas()
    barchartPandas.pivot('year', 'subject', 'marks').plot.bar(stacked=False, legend=False, figsize=(20,10))

Also if I have Large number of subjects my bar chart is really small. each bar is very tiny where its very difficult to visualize. How can I increase the size of each bar ?

not your answer but I tink it is good to add limit before toPandas df.limit(1000).toPandas() — Epsi95
– Epsi95, Commented Aug 31, 2021 at 17:20

werner · Accepted Answer · 2021-08-31 18:47:25Z

There are two options:

Transform the Spark dataframe into Pandas dataframe as first step and then run the pivot operation on the Pandas dataframe:
```
pandas_df = df.toPandas().pivot('year', 'subject', 'marks')
```
This will move all the data to the Spark driver first (when calling toPandas()) and then run the aggregation only on the driver. This is only a good approach when the amount of data is small and when the driver can handle the unaggregated data.
Execute the the pivot operation first on the Spark dataframe and then collect the aggregated result to the driver:
```
pandas_df = df.groupBy("year").pivot("subject").max("marks").toPandas().set_index("year")
```
In this second approach only the aggregated data is sent to the driver while the heavy lifting is done by Spark in the Spark cluster (if one is available). Unless the amount of data is really small, the second approach should perform better.

The result of both approaches is same Pandas dataframe. This dataframe can then be displayed as described in the question.

pandas_df.plot.bar(stacked=False, legend=False, figsize=(20,10))

You can control the width of the bars using the the width parameter:

pandas_df.plot.bar(stacked=False, legend=False, figsize=(20,10), width=.1)

Collectives™ on Stack Overflow

Displaying Dataframe data using a bar chart spark Pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related