19

Is there any way to plot information from Spark dataframe without converting the dataframe to pandas?

Did some online research but can't seem to find a way. I need to automatically save these plots as .pdf, so using the built-in visualization tool from databricks would not work.

Right now, this is what I'm doing (as an example):

# df = some Spark data frame 
df = df.toPandas()
df.plot()
display(plt.show())

I want to produce line graphs, histograms, bar charts and scatter plots without converting my dataframe to pandas dataframe. Thank you!

4 Answers 4

19

The display function is only available in databricks kernel notebook, not in spark

Sign up to request clarification or add additional context in comments.

Comments

2

If the spark dataframe 'df' (as asked in question) is of type 'pyspark.pandas.frame.DataFrame', then try the following:

# Plot spark dataframe
df.column_name.plot.pie()

        where column_name is one of the columns in the spark dataframe 'df'.

You can try finding the type of 'df' by

type(df)

There are other functions like

        pyspark.pandas.DataFrame.plot.line

        pyspark.pandas.DataFrame.plot.bar

        pyspark.pandas.DataFrame.plot.scatter

This can be found on the apache spark docs: https://spark.apache.org/docs/3.2.1/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.plot.bar.html

If the spark dataframe 'df' is of type 'pyspark.sql.dataframe.DataFrame', then try the following:

# Import pyspark.pandas
import pyspark.pandas as ps

# Convert pyspark.sql.dataframe.DataFrame to pyspark.pandas.frame.DataFrame
temp_df = ps.DataFrame( df ).set_index('column_name')

# Plot spark dataframe
temp_df.column_name.plot.pie()

Note: There could be other better ways to do it as well. If there are kindly suggest them in the comment.

Comments

1

Just to use display(<dataframe-name>) function with a Spark dataframe as the offical document Visualizations said as below.

enter image description here

Then, to select the plot type and change its options as the figure below to show a chart with spark dataframe directly.

enter image description here

If you want to show the same chart as the pandas dataframe plot of yours, your current way is the only way.

2 Comments

This does not seem to work for me in Jupyter notebooks. Is this answer specifically for Databricks notebooks?
Yes, it is for Databricks only.
0

Azure Databricks introduce native plotting in PySpark with Databricks Runtime 17.0

Note: With Spark native plotting, you no longer need to convert a data frame to a pandas object to create charts in code.

enter image description here

Example: using PySpark Plotting df.plot.line(), df.plot.bar() and df.plot.pie()

enter image description here

For more details, refer to PySpark Native Plotting

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.