18

I am trying to make a histogram with a column from a dataframe which looks like

DataFrame[C0: int, C1: int, ...]

If I were to make a histogram with the column C1, what should I do?

Some things I have tried are

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Which do not work because of mismatch in data types.

6 Answers 6

20

The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)
Sign up to request clarification or add additional context in comments.

Comments

16

What worked for me is

df.groupBy("C1").count().rdd.values().histogram()

I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module

1 Comment

Does this approach allow you to set bin size?
14

You can use histogram_numeric Hive UDAF:

import random

random.seed(323)

sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)

hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))

hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

You can also extract the column of interest and use histogram method on RDD:

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])

1 Comment

It may be worth noting that histogram_numeric does not guarantee evenly-spaced bins -- this surprised me, anyway.
4

Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).

1 Comment

Don't forget to cast it as int otherwise you'll get as many groups as before
2

If you want a to plot the Histogram, you could use the pyspark_dist_explore package:

fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))

If you would like the data in a pandas DataFrame you could use:

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))

Comments

-1

One easy way could be

import pandas as pd
x = df.select('symboling').toPandas()  # symboling is the column for histogram
x.plot(kind='hist')

1 Comment

This approach can be used for small datasets, otherwise it is suboptimal

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.