10

When doing groupby counts over multiple columns I get an error. Here is my dataframe and also an example that simply labels the distinct 'b' and 'c' groups.

df = pd.DataFrame(np.random.randint(0,2,(4,4)),
                  columns=['a', 'b', 'c', 'd'])
df['gr'] = df.groupby(['b', 'c']).grouper.group_info[0]
print df
   a  b  c  d  gr
0  0  1  0  0   1
1  1  1  1  0   2
2  0  0  1  0   0
3  1  1  1  1   2

However when the example is changed slightly so that count() is called instead of grouper.group_info[0], an error appear.

df = pd.DataFrame(np.random.randint(0,2,(4,4)),
                  columns=['a', 'b', 'c', 'd'])
df['gr'] = df.groupby(['b', 'c']).count()
print df

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-70-a46f632214e1> in <module>()
      1 df = pd.DataFrame(np.random.randint(0,2,(4,4)),
      2                   columns=['a', 'b', 'c', 'd'])
----> 3 df['gr'] = df.groupby(['b', 'c']).count()
      4 print df

C:\Python27\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
   2036         else:
   2037             # set column
-> 2038             self._set_item(key, value)
   2039 
   2040     def _setitem_slice(self, key, value):

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
   2082         ensure homogeneity.
   2083         """
-> 2084         value = self._sanitize_column(key, value)
   2085         NDFrame._set_item(self, key, value)
   2086 

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value)
   2110                     value = value.values.copy()
   2111                 else:
-> 2112                     value = value.reindex(self.index).values
   2113 
   2114                 if is_frame:

C:\Python27\lib\site-packages\pandas\core\frame.pyc in reindex(self, index, columns, method, level, fill_value, limit, copy)
   2527         if index is not None:
   2528             frame = frame._reindex_index(index, method, copy, level,
-> 2529                                          fill_value, limit)
   2530 
   2531         return frame

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit)
   2606                        limit=None):
   2607         new_index, indexer = self.index.reindex(new_index, method, level,
-> 2608                                                 limit=limit)
   2609         return self._reindex_with_indexers(new_index, indexer, None, None,
   2610                                            copy, fill_value)

C:\Python27\lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit)
   2181             else:
   2182                 # hopefully?
-> 2183                 target = MultiIndex.from_tuples(target)
   2184 
   2185         return target, indexer

C:\Python27\lib\site-packages\pandas\core\index.pyc in from_tuples(cls, tuples, sortorder, names)
   1803                 tuples = tuples.values
   1804 
-> 1805             arrays = list(lib.tuples_to_object_array(tuples).T)
   1806         elif isinstance(tuples, list):
   1807             arrays = list(lib.to_object_array_tuples(tuples).T)

C:\Python27\lib\site-packages\pandas\lib.pyd in pandas.lib.tuples_to_object_array (pandas\lib.c:42342)()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

1 Answer 1

13

Evaluate df.groupby(['b', 'c']).count() in an interactive session:

In [150]: df.groupby(['b', 'c']).count()
Out[150]: 
     a  b  c  d
b c            
0 0  1  1  1  1
  1  1  1  1  1
1 1  2  2  2  2

This is a whole DataFrame. It is probably not what you want to assign to a new column of df (in fact, you can not assign a column to a DataFrame, which is why an albeit cryptic exception is raised.).


If you wish to create a new column which counts the number of rows in each group, you could use

df['gr'] = df.groupby(['b', 'c'])['a'].transform('count')

For example,

import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0, 2, (4, 4)),
                  columns=['a', 'b', 'c', 'd'])
print(df)
#    a  b  c  d
# 0  1  1  0  0
# 1  1  1  1  1
# 2  1  0  0  1
# 3  0  1  1  0

df['gr'] = df.groupby(['b', 'c'])['a'].transform('count')

df['comp_ids'] = df.groupby(['b', 'c']).grouper.group_info[0]
print(df)

yields

   a  b  c  d  gr  comp_ids
0  1  1  0  0   1         1
1  1  1  1  1   2         2
2  1  0  0  1   1         0
3  0  1  1  0   2         2

Notice that df.groupby(['b', 'c']).grouper.group_info[0] is returning something other than the counts of the number of rows in each group. Rather, it is returning a label for each group.

Sign up to request clarification or add additional context in comments.

3 Comments

If I turn it into a Series using df.groupby(['b', 'c'])['a'].count() it still doesn't work. Also notice that df['gr'] = df['a']+df['b'] works so I don't understand your comment about not being able to assign columns to a dataframe.
df['a']+df['b'] is a Series with a single-level index, so there is no problem assigning it to df['gr']. df.groupby(['b', 'c'])['a'].count() is a Series, but it has a multiindex, so it is still not clear how that could be assigned to df['gr'], which has a single-level index.
I like the comment # hopefully? (!) in the exception, probably that's the bit to be in a try/except for a kinder message.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.