How to plot using matplotlib and pandas in pyspark environment?

Question

I have a very large pyspark dataframe and I took a sample and convert it into pandas dataframe

sample = heavy_pivot.sample(False, fraction = 0.2, seed = None)
sample_pd = sample.toPandas()

The dataframe looks like this:

sample_pd[['client_id', 'beer_freq']].head(10)


  client_id  beer_freq
0   1000839   0.000000
1   1002185   0.000000
2   1003366   1.000000
3   1005218   1.000000
4   1005483   1.000000
5    100964   0.434783
6    101272   0.166667
7   1017462   0.000000
8   1020561   0.000000
9   1023646   0.000000

I want to plot a histogram of column "beer_freq"

import matplotlib.pyplot as plt
matplotlib.pyplot.switch_backend('agg')

sample_pd.hist('beer_freq', bins = 100)

The plot did not show up... It gives results like this:

 >>>array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f60f6fd0750>]], dtype=object)

It seems like that I cannot write general python code using matplotlib and pandas dataframe to plot figures in pyspark environment.

If I call plt.show() Nothing happens...

@DavidG Yes, if I add plt.show() Nothing happens. It is so weird. — Elsa Li
– Elsa Li, Commented May 8, 2018 at 19:15

Gaurav · Accepted Answer · 2019-02-18 06:14:16Z

2

%matplotlib inline is not supported in Databricks. You can display matplotlib figures using display(). For an example, see https://docs.databricks.com/user-guide/visualizations/matplotlib-and-ggplot.html

answered Feb 18, 2019 at 6:14

Gaurav

791 silver badge2 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

David L Over a year ago

As of DBR 6.4+, %matplotlib inline is supported, so you no longer need do call display()

David L · Accepted Answer · 2020-04-23 18:51:03Z

1

As of DBR 6.4+, you can use %matplotlib inline.

%matplotlib inline
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.hist('sepal_width', bins = 100)

answered Apr 23, 2020 at 18:51

David L

1366 bronze badges

Comments

Anxy · Accepted Answer · 2018-08-20 09:18:54Z

0

Try the following:

import matplotlib.pyplot as plt
%matplotlib inline

answered Aug 20, 2018 at 9:18

Anxy

1

Comments

Yasi Klingler · Accepted Answer · 2019-08-22 11:57:42Z

0

it is not accessible. as Gaurav mentioned, use display() as follow:

col_df = heavy_pivot.select('beer_freq')
display(col_df)

like that, you don't need to change it to pandas dataframe and the final plot looks the same. just after displaying, use the plot button under the output to choose histogram.

source:

answered Aug 22, 2019 at 11:57

Yasi Klingler

6367 silver badges14 bronze badges

Comments

Pan · Accepted Answer · 2021-07-18 04:14:09Z

0

It seems that you are using ipython/ python shell in a terminal. Since you are using spark, I guess your shell environment is running on a remote server. In order to plot on a remote server, you need to enable X11 forwarding. See https://adoni.github.io/2019/01/08/plot-on-server/#through-x11 as a reference.

Other options:

Install jupyter notebook or jupyterlab, and plot in a notebook environment
Save the plot using plt.savefig('plot.png') to save the plot to a file. Then download to local.

answered Jul 18, 2021 at 4:14

Pan

9882 gold badges17 silver badges35 bronze badges

Collectives™ on Stack Overflow

How to plot using matplotlib and pandas in pyspark environment?

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related