I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,
import numpy
import pandas
numpy.random.seed(35)
values = numpy.random.random((2,10000))
xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)
I can easily get the desired output with
print numpy.histogram2d(values[0], values[1], (xbins,ybins))
giving
[[ 408. 373. 405. 411. 400.]
[ 390. 413. 400. 414. 368.]
[ 354. 414. 421. 400. 413.]
[ 426. 393. 407. 416. 412.]
[ 412. 397. 396. 356. 401.]
[ 0. 0. 0. 0. 0.]]
However, with pandas,
df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
pandas.cut(df['y'], ybins)])
print binned.size().unstack()
prints
y (0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
x
(0, 0.2] 408 373 405 411 400
(0.2, 0.4] 390 413 400 414 368
(0.4, 0.6] 354 414 421 400 413
(0.6, 0.8] 426 393 407 416 412
(0.8, 1] 412 397 396 356 401
i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.
In principle I can come up with ways to check if an index is present, using something like
allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
'({0}, {1}]'.format(ybins[j-1], ybins[j]))
for j in xrange(1, len(ybins))
for i in xrange(1, len(xbins))]
However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().
Any help is much appreciated.
.size()ignores missing values. A workaround could be to usecount()which appears to keep the missing values when applied to thebinnedgroupby object in this case:binned.count()['x'].unstack().fillna(0).pandasv0.16(available in my work computer). If I runbinned.count()I getValueError: Cannot convert NA to integer. However in my laptop (withv0.17.1)count()works fine.binned.agg(lambda x : 1.0*x.count()).unstack()? It should return floats, so hopefully, nan's wouldn't be converted..fillna(0)but that's fine, thanks!