14

Sample Pandas Dataframe:

ID Name COMMENT1 COMMENT2 NUM
1  dan  hi       hello    1
1  dan  you      friend   2
3  jon  yeah     nope     3
2  jon  dog      cat      .5
3  jon  yes      no       .1

I am trying to create a dataframe that groups by ID and NAME that concatenates COMMENT1 and COMMENT2 that also sums NUM.

This is what I'm looking for:

ID Name COMMENT1     COMMENT2        NUM
1  dan  hi you       hello friend    3
3  jon  yeah yes     nope no         3.1
2  jon  dog          cat             .5

I tried using this:

input_df = input_df.groupby(['ID', 'NAME', 'COMMENT1', 'COMMENT2']).sum().reset_index()

But it doesn't work.

If I use this:

input_df = input_df.groupby(['ID']).sum().reset_index()

It sums the NUM column but leaves out all other columns.

1

3 Answers 3

20

Let us make it into one line

df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]: 
   ID Name  COMMENT1      COMMENT2  NUM
0   1  dan    hi you  hello friend  3.0
1   2  jon       dog           cat  0.5
2   3  jon  yeah yes       nope no  3.1
Sign up to request clarification or add additional context in comments.

6 Comments

if there's a NaN in the group this doesn't work, correct?
@Yuca you mean the group key ?
if instead of 'cat' there was NaN, then it looks like the code wouldn't work, no?
@Yuca you can replace the NaN to'NaN' for future adjust
@WeNYoBen, thank you. Does this preserve the order of the strings in the pandas dataframe column that is being concatenated?
|
5

You can also just tell .agg() which aggregator functions to use for each column, and for the string columns, pass ' '.join (notice there're no parenthesis since you don't want to call .join but rather pass it as the argument itself):

df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})

Comments

-1

Converting your data example into a csv file, we can do the following:

import pandas as pd

def grouping_Cols_by_Cols(DF, grouping_Columns, num_Columns):
    # numerical columns can mess us up ...
    column_Names = DF.columns.tolist()
    # so, convert all columns' values to strings
    for column_Name in column_Names:
        DF[column_Name] = DF[column_Name].map(str) + ' '
    DF = DF.groupby(by=grouping_Columns).sum()

    # NOW, convert the numerical string columns to an expression ...
    for num_Col in num_Columns:
        column_Names = DF.columns.tolist()
        num_Col_i = column_Names.index(num_Col)
        for i in range(len(DF)):
            String = DF[num_Col].iloc[i] 
            value = eval(String.rstrip(' ').replace(' ','+'))
            DF.iat[i,num_Col_i] = value

    return DF

###############################################################
### Operations Section
###############################################################

df = pd.read_csv("UnCombinedData.csv")

grouping_Columns = ['ID','Name']
num_Columns = ['NUM']
df = grouping_Cols_by_Cols(df,grouping_Columns, num_Columns)

print df

With a little more work, the defined function could auto detect, which columns have numbers in them and add them to a numerical columns list.

I think this is similar, but not exact, to problems and challenges encountered in this post.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.