Sum grouped Pandas dataframe by single column

Question

I have a Pandas dataframe:

test=pd.DataFrame(columns=['GroupID','Sample','SampleMeta','Value'])
test.loc[0,:]='1','S1','S1_meta',1
test.loc[1,:]='1','S1','S1_meta',1
test.loc[2,:]='2','S2','S2_meta',1

I want to (1) group by two columns ('GroupID' and 'Sample'), (2) sum 'Value' per group, and (3) retain only unique values in 'SampleMeta' per group. The desired result ('GroupID' and 'Sample' as index) is shown:

                SampleMeta  Value
GroupID Sample                       
1       S1      S1_meta      2
2       S2      S2_meta      1

df.groupby() and the .sum() method get close, but .sum() concatenates identical values in the 'Values' column within a group. As a result, the 'S1_meta' value is duplicated.

g=test.groupby(['GroupID','Sample'])
print g.sum()

                SampleMeta      Value
GroupID Sample                       
1       S1      S1_metaS1_meta  2
2       S2      S2_meta         1

Is there a way to achieve the desired result using groupby() and associated methods? Merging the summed 'Value' per group with a separate 'SampleMeta' DataFrame works but there must be a more elegant solution.

Karl D. · Accepted Answer · 2014-05-13 23:50:43Z

0

Well, you can include SampleMeta as part of the groupby:

print test.groupby(['GroupID','Sample','SampleMeta']).sum()

                           Value
GroupID Sample SampleMeta       
1       S1     S1_meta         2
2       S2     S2_meta         1

If you don't want SampleMeta as part of the index when done you could modify it as follows:

print test.groupby(['GroupID','Sample','SampleMeta']).sum().reset_index(level=2)

               SampleMeta  Value
GroupID Sample                  
1       S1        S1_meta      2
2       S2        S2_meta      1

This will only work right if there is no variation within SampleMeta for ['GroupID','Sample']. Of course, If there was variation within ['GroupID','Sample'] then you probably to exclude SampleMeta from the groupby/sum entirely:

print test.groupby(['GroupID','Sample'])['Value'].sum()

GroupID  Sample
1        S1        2
2        S2        1

answered May 13, 2014 at 23:50

Karl D.

13.8k5 gold badges59 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lmart999 Over a year ago

Thanks, this works well for the example I gave. For large DataFrames (>100k entries) with several columns that I want to preserve, including these columns in groupby made the operation very slow. So, this strategy may not scale well to large DataFrame with many columns (like `SampleMeta') to preserve.

Collectives™ on Stack Overflow

Sum grouped Pandas dataframe by single column

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related