1

I have the following dataframe df.

root
 |-- id: long (nullable = false)
 |-- subject: string (nullable = true)
 |-- Marks: long (nullable = true)
 |-- year: long (nullable = true)

And I want to draw a bar chart using the columns subject, marks and year. For each I want to see how marks for each subject is scored. I am unable to figure out how I can use three or more columns to draw a bar chart. I tried the below code to try mapping all three columns. Is this the correct way?

 barchartPandas = df.toPandas()
    barchartPandas.pivot('year', 'subject', 'marks').plot.bar(stacked=False, legend=False, figsize=(20,10))

Also if I have Large number of subjects my bar chart is really small. each bar is very tiny where its very difficult to visualize. How can I increase the size of each bar ?

1
  • not your answer but I tink it is good to add limit before toPandas df.limit(1000).toPandas() Commented Aug 31, 2021 at 17:20

1 Answer 1

1

There are two options:

  1. Transform the Spark dataframe into Pandas dataframe as first step and then run the pivot operation on the Pandas dataframe:

    pandas_df = df.toPandas().pivot('year', 'subject', 'marks')
    

    This will move all the data to the Spark driver first (when calling toPandas()) and then run the aggregation only on the driver. This is only a good approach when the amount of data is small and when the driver can handle the unaggregated data.

  2. Execute the the pivot operation first on the Spark dataframe and then collect the aggregated result to the driver:

    pandas_df = df.groupBy("year").pivot("subject").max("marks").toPandas().set_index("year")
    

    In this second approach only the aggregated data is sent to the driver while the heavy lifting is done by Spark in the Spark cluster (if one is available). Unless the amount of data is really small, the second approach should perform better.

The result of both approaches is same Pandas dataframe. This dataframe can then be displayed as described in the question.

pandas_df.plot.bar(stacked=False, legend=False, figsize=(20,10)) 

You can control the width of the bars using the the width parameter:

pandas_df.plot.bar(stacked=False, legend=False, figsize=(20,10), width=.1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.