Making histogram with Spark DataFrame column

Question

I am trying to make a histogram with a column from a dataframe which looks like

DataFrame[C0: int, C1: int, ...]

If I were to make a histogram with the column C1, what should I do?

Some things I have tried are

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Which do not work because of mismatch in data types.

Cash Lo · Accepted Answer · 2018-01-30 13:35:13Z

20

The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)

edited Jan 30, 2018 at 13:35

Cash Lo

1,2891 gold badge11 silver badges21 bronze badges

answered Aug 22, 2017 at 19:49

Brian Wylie

2,67030 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

lanenok · Accepted Answer · 2016-03-17 12:05:05Z

16

What worked for me is

df.groupBy("C1").count().rdd.values().histogram()

I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module

answered Mar 17, 2016 at 12:05

lanenok

2,75919 silver badges26 bronze badges

1 Comment

Sledge Over a year ago

Does this approach allow you to set bin size?

zero323 · Accepted Answer · 2016-03-16 19:25:04Z

14

You can use histogram_numeric Hive UDAF:

import random

random.seed(323)

sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)

hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))

hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

You can also extract the column of interest and use histogram method on RDD:

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])

edited Mar 16, 2016 at 19:25

answered Mar 16, 2016 at 18:55

zero323

331k108 gold badges982 silver badges958 bronze badges

1 Comment

abeboparebop Over a year ago

It may be worth noting that histogram_numeric does not guarantee evenly-spaced bins -- this surprised me, anyway.

Assaf Mendelson · Accepted Answer · 2016-03-16 18:30:39Z

4

Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).

answered Mar 16, 2016 at 18:30

Assaf Mendelson

13k5 gold badges51 silver badges57 bronze badges

1 Comment

MaFF Over a year ago

Don't forget to cast it as int otherwise you'll get as many groups as before

Chris van den Berg · Accepted Answer · 2017-07-18 15:05:09Z

2

If you want a to plot the Histogram, you could use the pyspark_dist_explore package:

fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))

If you would like the data in a pandas DataFrame you could use:

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))

answered Jul 18, 2017 at 15:05

Chris van den Berg

3373 silver badges5 bronze badges

Comments

Jagannath Banerjee · Accepted Answer · 2020-05-26 03:52:20Z

-1

One easy way could be

import pandas as pd
x = df.select('symboling').toPandas()  # symboling is the column for histogram
x.plot(kind='hist')

answered May 26, 2020 at 3:52

Jagannath Banerjee

2,1811 gold badge11 silver badges7 bronze badges

1 Comment

Stanislau Fink Over a year ago

This approach can be used for small datasets, otherwise it is suboptimal

Collectives™ on Stack Overflow

Making histogram with Spark DataFrame column

6 Answers 6

Comments

1 Comment

1 Comment

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

1 Comment

1 Comment

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related