5

I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,

import numpy
import pandas

numpy.random.seed(35)
values = numpy.random.random((2,10000))

xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)

I can easily get the desired output with

print numpy.histogram2d(values[0], values[1], (xbins,ybins))

giving

[[ 408.  373.  405.  411.  400.]
 [ 390.  413.  400.  414.  368.]
 [ 354.  414.  421.  400.  413.]
 [ 426.  393.  407.  416.  412.]
 [ 412.  397.  396.  356.  401.]
 [   0.    0.    0.    0.    0.]]

However, with pandas,

df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
                     pandas.cut(df['y'], ybins)])
print binned.size().unstack()

prints

y           (0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1]
x                                                                 
(0, 0.2]         408         373         405         411       400
(0.2, 0.4]       390         413         400         414       368
(0.4, 0.6]       354         414         421         400       413
(0.6, 0.8]       426         393         407         416       412
(0.8, 1]         412         397         396         356       401

i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.

In principle I can come up with ways to check if an index is present, using something like

allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
            '({0}, {1}]'.format(ybins[j-1], ybins[j]))
           for j in xrange(1, len(ybins))
           for i in xrange(1, len(xbins))]

However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().

Any help is much appreciated.

4
  • 1
    Looks like .size() ignores missing values. A workaround could be to use count() which appears to keep the missing values when applied to the binned groupby object in this case: binned.count()['x'].unstack().fillna(0). Commented May 7, 2016 at 21:04
  • 1
    It seems the behavior might have changed after pandas v0.16 (available in my work computer). If I run binned.count() I get ValueError: Cannot convert NA to integer. However in my laptop (with v0.17.1) count() works fine. Commented May 7, 2016 at 22:55
  • 2
    It's a guess, but what happens if you do binned.agg(lambda x : 1.0*x.count()).unstack()? It should return floats, so hopefully, nan's wouldn't be converted. Commented May 8, 2016 at 21:16
  • @ptrj it works great. You do have to add .fillna(0) but that's fine, thanks! Commented May 19, 2016 at 12:49

1 Answer 1

1

It appears that pd.cut keeps your binning information which means we can use it in a reindex:

In [79]: xcut = pd.cut(df['x'], xbins)

In [80]: ycut = pd.cut(df['y'], ybins)

In [81]: binned = df.groupby([xcut, ycut])

In [82]: sizes = binned.size()

In [85]: (sizes.reindex(pd.MultiIndex.from_product([xcut.cat.categories, ycut.cat.categories]))
    ...:       .unstack()
    ...:       .fillna(0.0))
    ...:
Out[85]:
            (0.0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1.0]
(0.0, 0.2]       408.0       373.0       405.0       411.0       400.0
(0.2, 0.4]       390.0       413.0       400.0       414.0       368.0
(0.4, 0.6]       354.0       414.0       421.0       400.0       413.0
(0.6, 0.8]       426.0       393.0       407.0       416.0       412.0
(0.8, 1.0]       412.0       397.0       396.0       356.0       401.0
(1.0, 1.2]         0.0         0.0         0.0         0.0         0.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.