2

Currently my dataframe looks something similar to:

     ID  Year   Str1     Str2     Value
0    1   2014   high     black    120
1    1   2015   high     blue     20
2    2   2014   medium   red      10
3    2   2014   medium   blue     50
4    3   2015   low      blue     30
5    3   2015   high     blue     .5
6    3   2015   high     red      10

Desired:

     ID  Year   Str1        Str2          Value
0    1   2014   high        black         120
1    1   2015   high        blue          20
2    2   2014   medium      red, blue     60
3    3   2015   low, high   blue, red     40.5

Trying to group by columns ID and Name, then getting sum of the numbers but a list of the strings. If removing duplicate strings is possible as in the example, that'd be helpful but not necessary.

This operation will be done to ~100 dataframes, ID and Year are the only column names which can be found in every dataframe. The dataframes do vary slightly: they have either value column, str columns or both.

I have browsed stackoverflow a lot and tried:

df.groupby(['ID', 'Year'], as_index=False).agg(lambda x: x.sum() if x.dtype=='int64' else ', '.join(x))

Which gave the error DataFrame object has no attribute dtype (which makes sense, since grouping by multiple columns returns more dataframes).

I also tried looping the columns one by one, and then if column has numbers, it would count the sum, else make a list:

for col in df:
    if col in ['ID', 'Year']:
        continue 

    if df[col].dtype.kind == 'i' or df[col].dtype.kind == 'f':
         df = df.groupby(['ID', 'Year'])[col].apply(sum)
    else:
         df = df.groupby(['ID', 'Year'])[col].unique().reset_index()

However, after doing the operation the first time, it got rid of all the other columns.

Thanks in advance.

2 Answers 2

3

You need check if numeric column, e.g. by this solution:

df = (df.groupby(['ID', 'Year'], as_index=False)
       .agg(lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)))
print (df)
   ID  Year             Str1             Str2  Value
0   1  2014             high            black  120.0
1   1  2015             high             blue   20.0
2   2  2014   medium, medium        red, blue   60.0
3   3  2015  low, high, high  blue, blue, red   40.5

from pandas.api.types import is_numeric_dtype

df = (df.groupby(['ID', 'Year'], as_index=False)
        .agg(lambda x: x.sum() if is_numeric_dtype(x) else ', '.join(x)))
Sign up to request clarification or add additional context in comments.

1 Comment

If anyone encounters strange behaviour and instead of getting proper lists/sums you get a list of column names for every row, you might have NaN values in the data. Replacing NaN values with df = df.fillna('') was required for this to work.
1

I had a similar question, so say I have a data like this with columns I want to groupby email and do different agg function to the different columns, so the standard groupby function wasn't good enough.

Anyways, heres a dummy dataset:

    Email            Phone          State
0   [email protected] 123-456-7890    NY
1   [email protected] 321-654-0987    LA
2   [email protected]    123-789-4567    WA
3   [email protected] 873-345-3456    MN
4   [email protected] 123-345-3456    NY
5   [email protected] 000-000-0000    KY

It would be useful to know which one is the first dupe item, so we process that and ignore the others. So first up, I want to mark the first duplicate item.

this looks complicated but what it does is: gets a list of True vals for all the dupes and does a AND with a list of True vals for all first dupes.

df["first_dupe"] = df.duplicated("Email", keep=False) & ~df.duplicated("Email", keep="first")

then applied this function to the dataframe:

def combine_rows(row, key="Email", cols_to_combine=["Phone", "State"]):
    """takes in a row, looks at the key column
        if its the first dupe, combines the data in cols_to_combine with the other rows with same key
        needs a dataframe with a bool column first_dupe with True if the row is the first dupe"""

    if row["first_dupe"] == True:
        # making a df of dupes item
        dupes = df[df[key]==row[key]]

        for i, dupe_row in dupes.iloc[1:].iterrows():   # skipping the first row, since thats our first_dupe
            for col in cols_to_combine:
                row[col] += ", " + dupe_row[col]
        # make sure first_dupe doesn't get processed again
        row.first_dupe = False  
    return row

df = df.apply(combine_rows, axis=1, result_type=None)

You can modify the combine rows function to do different things to different columns.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.