Python pandas groupby multiple columns, creating list of strings but summing numbers

Question

Currently my dataframe looks something similar to:

     ID  Year   Str1     Str2     Value
0    1   2014   high     black    120
1    1   2015   high     blue     20
2    2   2014   medium   red      10
3    2   2014   medium   blue     50
4    3   2015   low      blue     30
5    3   2015   high     blue     .5
6    3   2015   high     red      10

Desired:

     ID  Year   Str1        Str2          Value
0    1   2014   high        black         120
1    1   2015   high        blue          20
2    2   2014   medium      red, blue     60
3    3   2015   low, high   blue, red     40.5

Trying to group by columns ID and Name, then getting sum of the numbers but a list of the strings. If removing duplicate strings is possible as in the example, that'd be helpful but not necessary.

This operation will be done to ~100 dataframes, ID and Year are the only column names which can be found in every dataframe. The dataframes do vary slightly: they have either value column, str columns or both.

I have browsed stackoverflow a lot and tried:

df.groupby(['ID', 'Year'], as_index=False).agg(lambda x: x.sum() if x.dtype=='int64' else ', '.join(x))

Which gave the error DataFrame object has no attribute dtype (which makes sense, since grouping by multiple columns returns more dataframes).

I also tried looping the columns one by one, and then if column has numbers, it would count the sum, else make a list:

for col in df:
    if col in ['ID', 'Year']:
        continue 

    if df[col].dtype.kind == 'i' or df[col].dtype.kind == 'f':
         df = df.groupby(['ID', 'Year'])[col].apply(sum)
    else:
         df = df.groupby(['ID', 'Year'])[col].unique().reset_index()

However, after doing the operation the first time, it got rid of all the other columns.

Thanks in advance.

jezrael · Accepted Answer · 2018-07-14 11:31:52Z

3

You need check if numeric column, e.g. by this solution:

df = (df.groupby(['ID', 'Year'], as_index=False)
       .agg(lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)))
print (df)
   ID  Year             Str1             Str2  Value
0   1  2014             high            black  120.0
1   1  2015             high             blue   20.0
2   2  2014   medium, medium        red, blue   60.0
3   3  2015  low, high, high  blue, blue, red   40.5

from pandas.api.types import is_numeric_dtype

df = (df.groupby(['ID', 'Year'], as_index=False)
        .agg(lambda x: x.sum() if is_numeric_dtype(x) else ', '.join(x)))

answered Jul 14, 2018 at 11:31

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nclsl Over a year ago

If anyone encounters strange behaviour and instead of getting proper lists/sums you get a list of column names for every row, you might have NaN values in the data. Replacing NaN values with df = df.fillna('') was required for this to work.

K.O · Accepted Answer · 2018-08-30 12:46:46Z

I had a similar question, so say I have a data like this with columns I want to groupby email and do different agg function to the different columns, so the standard groupby function wasn't good enough.

Anyways, heres a dummy dataset:

    Email            Phone          State
0   [email protected] 123-456-7890    NY
1   [email protected] 321-654-0987    LA
2   [email protected]    123-789-4567    WA
3   [email protected] 873-345-3456    MN
4   [email protected] 123-345-3456    NY
5   [email protected] 000-000-0000    KY

It would be useful to know which one is the first dupe item, so we process that and ignore the others. So first up, I want to mark the first duplicate item.

this looks complicated but what it does is: gets a list of True vals for all the dupes and does a AND with a list of True vals for all first dupes.

df["first_dupe"] = df.duplicated("Email", keep=False) & ~df.duplicated("Email", keep="first")

then applied this function to the dataframe:

def combine_rows(row, key="Email", cols_to_combine=["Phone", "State"]):
    """takes in a row, looks at the key column
        if its the first dupe, combines the data in cols_to_combine with the other rows with same key
        needs a dataframe with a bool column first_dupe with True if the row is the first dupe"""

    if row["first_dupe"] == True:
        # making a df of dupes item
        dupes = df[df[key]==row[key]]

        for i, dupe_row in dupes.iloc[1:].iterrows():   # skipping the first row, since thats our first_dupe
            for col in cols_to_combine:
                row[col] += ", " + dupe_row[col]
        # make sure first_dupe doesn't get processed again
        row.first_dupe = False  
    return row

df = df.apply(combine_rows, axis=1, result_type=None)

You can modify the combine rows function to do different things to different columns.

Collectives™ on Stack Overflow

Python pandas groupby multiple columns, creating list of strings but summing numbers

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related