1

I have a very large pyspark dataframe and I took a sample and convert it into pandas dataframe

sample = heavy_pivot.sample(False, fraction = 0.2, seed = None)
sample_pd = sample.toPandas()

The dataframe looks like this:

sample_pd[['client_id', 'beer_freq']].head(10)


  client_id  beer_freq
0   1000839   0.000000
1   1002185   0.000000
2   1003366   1.000000
3   1005218   1.000000
4   1005483   1.000000
5    100964   0.434783
6    101272   0.166667
7   1017462   0.000000
8   1020561   0.000000
9   1023646   0.000000

I want to plot a histogram of column "beer_freq"

import matplotlib.pyplot as plt
matplotlib.pyplot.switch_backend('agg')

sample_pd.hist('beer_freq', bins = 100)

The plot did not show up... It gives results like this:

 >>>array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f60f6fd0750>]], dtype=object)

It seems like that I cannot write general python code using matplotlib and pandas dataframe to plot figures in pyspark environment.

If I call plt.show() Nothing happens...

2
  • Did you call plt.show()? Commented May 8, 2018 at 19:13
  • @DavidG Yes, if I add plt.show() Nothing happens. It is so weird. Commented May 8, 2018 at 19:15

5 Answers 5

2

%matplotlib inline is not supported in Databricks. You can display matplotlib figures using display(). For an example, see https://docs.databricks.com/user-guide/visualizations/matplotlib-and-ggplot.html

Sign up to request clarification or add additional context in comments.

1 Comment

As of DBR 6.4+, %matplotlib inline is supported, so you no longer need do call display()
1

As of DBR 6.4+, you can use %matplotlib inline.

%matplotlib inline
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.hist('sepal_width', bins = 100)

Comments

0

Try the following:

import matplotlib.pyplot as plt
%matplotlib inline

Comments

0

it is not accessible. as Gaurav mentioned, use display() as follow:

col_df = heavy_pivot.select('beer_freq')
display(col_df)

like that, you don't need to change it to pandas dataframe and the final plot looks the same. just after displaying, use the plot button under the output to choose histogram.

enter image description here

source:

Comments

0

It seems that you are using ipython/ python shell in a terminal. Since you are using spark, I guess your shell environment is running on a remote server. In order to plot on a remote server, you need to enable X11 forwarding. See https://adoni.github.io/2019/01/08/plot-on-server/#through-x11 as a reference.

Other options:

  1. Install jupyter notebook or jupyterlab, and plot in a notebook environment
  2. Save the plot using plt.savefig('plot.png') to save the plot to a file. Then download to local.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.