pandas groupby report empty bins

Question

I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,

import numpy
import pandas

numpy.random.seed(35)
values = numpy.random.random((2,10000))

xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)

I can easily get the desired output with

print numpy.histogram2d(values[0], values[1], (xbins,ybins))

giving

[[ 408.  373.  405.  411.  400.]
 [ 390.  413.  400.  414.  368.]
 [ 354.  414.  421.  400.  413.]
 [ 426.  393.  407.  416.  412.]
 [ 412.  397.  396.  356.  401.]
 [   0.    0.    0.    0.    0.]]

However, with pandas,

df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
                     pandas.cut(df['y'], ybins)])
print binned.size().unstack()

prints

y           (0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1]
x                                                                 
(0, 0.2]         408         373         405         411       400
(0.2, 0.4]       390         413         400         414       368
(0.4, 0.6]       354         414         421         400       413
(0.6, 0.8]       426         393         407         416       412
(0.8, 1]         412         397         396         356       401

i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.

In principle I can come up with ways to check if an index is present, using something like

allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
            '({0}, {1}]'.format(ybins[j-1], ybins[j]))
           for j in xrange(1, len(ybins))
           for i in xrange(1, len(xbins))]

However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().

Any help is much appreciated.

Looks like .size() ignores missing values. A workaround could be to use count() which appears to keep the missing values when applied to the binned groupby object in this case: binned.count()['x'].unstack().fillna(0). — Alex Riley
– Alex Riley, Commented May 7, 2016 at 21:04
It seems the behavior might have changed after pandas v0.16 (available in my work computer). If I run binned.count() I get ValueError: Cannot convert NA to integer. However in my laptop (with v0.17.1) count() works fine. — Cristóbal Sifón
– Cristóbal Sifón, Commented May 7, 2016 at 22:55
It's a guess, but what happens if you do binned.agg(lambda x : 1.0*x.count()).unstack()? It should return floats, so hopefully, nan's wouldn't be converted. — ptrj
– ptrj, Commented May 8, 2016 at 21:16
@ptrj it works great. You do have to add .fillna(0) but that's fine, thanks! — Cristóbal Sifón
– Cristóbal Sifón, Commented May 19, 2016 at 12:49

Dan Frank · Accepted Answer · 2017-09-25 23:54:12Z

It appears that pd.cut keeps your binning information which means we can use it in a reindex:

In [79]: xcut = pd.cut(df['x'], xbins)

In [80]: ycut = pd.cut(df['y'], ybins)

In [81]: binned = df.groupby([xcut, ycut])

In [82]: sizes = binned.size()

In [85]: (sizes.reindex(pd.MultiIndex.from_product([xcut.cat.categories, ycut.cat.categories]))
    ...:       .unstack()
    ...:       .fillna(0.0))
    ...:
Out[85]:
            (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
(0.0, 0.2]       408.0       373.0       405.0       411.0       400.0
(0.2, 0.4]       390.0       413.0       400.0       414.0       368.0
(0.4, 0.6]       354.0       414.0       421.0       400.0       413.0
(0.6, 0.8]       426.0       393.0       407.0       416.0       412.0
(0.8, 1.0]       412.0       397.0       396.0       356.0       401.0
(1.0, 1.2]         0.0         0.0         0.0         0.0         0.0

Collectives™ on Stack Overflow

pandas groupby report empty bins

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related