Join strings by space using pandas aggregate function

Question

I have a CSV file that contains rows with some similar id's. I found a nice approach doing this with dataframe and I found the code doing this from [this] 1 post.

Sample CSV file:

    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

Desired output is:

Sample CSV file:

id   messages        
011   I am not driving home Please pick me up I don't have money     
103   The car already park No need for ticket         
104   I will buy a car

Now the code that I have so far is:

aggregation_functions = {'message':'sum'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)

Now what I am getting with this code is:

id   messages        
011   I am not driving homePlease pick me upI don't have money      
103   The car already parkNo need for ticket         
104   I will buy a car

I just want to have the space between words (eg. "homePlease" > "home Please") and avoid redundancy such as having two times of I will buy a car.

I already checked the post 2 but I couldn't find my answer.

Also do i need to use .reindex(columns=df.columns) after the aggregate(aggregation_functions)

Like:

df_new = df.groupby(df['id']).aggregate(aggregation_functions).reindex(columns=df.columns)

@WeNYoBen Thank you. It worked. I also wondering if reindex is necessary to use? thanks — Bilgin
– Bilgin, Commented May 31, 2019 at 1:10
Reindex has nothing to do here, that only changes the index/columns and/or its ordering. You want to aggregate the contents, not modify the structure. — cs95
– cs95, Commented May 31, 2019 at 1:15

cs95 · Accepted Answer · 2019-05-31 03:07:52Z

3

To remove redundancy, I suggest GroupBy.unique followed by str.join:

df.groupby('id')['messages'].unique().str.join(' ')

Alternatively, using GroupBy.agg with set + ' '.join:

df.groupby('id')['messages'].agg(lambda x: ' '.join(set(x)))

Both print,

# id
# 11     I don't have money I am not driving home Pleas...
# 103              No need for ticket The car already park
# 104                                     I will buy a car
# Name: messages, dtype: object

To return a DataFrame, call reset_index at the end... for example,

df.groupby('id')['messages'].unique().str.join(' ').reset_index()

#     id                                           messages
# 0   11  I am not driving home Please pick me up I don'...
# 1  103            The car already park No need for ticket
# 2  104                                   I will buy a car

edited May 31, 2019 at 3:07

answered May 31, 2019 at 1:11

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Bilgin Over a year ago

Thanks for your comment. I need to have the header id and message, because I will use them in my dataframe. But based on your code and output, I am not able to have the header. Can you help. Thanks

cs95 Over a year ago

@Bilgin It's as simple as result = df.groupby('id')['messages'].unique().str.join(' ').reset_index()

Bilgin Over a year ago

thanks. Sorry, i am new in python. how about if I use df.groupby('id')['messages'].unique().str.join(' ')? is it the same?

cs95 Over a year ago

@Bilgin That returns a Series with ID as the index. If you call .reset_index() on that result, it converts the index into a column, and the overall result is a DataFrame. Take a look at the code and let me know if there are other issues.

Bilgin Over a year ago

I understand now. Thanks. Now it is returning the header. One last question is that how can I get rid of that index number on the left 0,1,2,... I am doing sum results generated and saving them into CSV file and those numbers appearing in the CSV file with no header and it is confusing. Thanks for your time

|

BENY · Accepted Answer · 2019-05-31 01:14:19Z

3

So it will be drop_duplicates first , the agg join

df.drop_duplicates().groupby('id',as_index=False).messages.agg(' '.join)

answered May 31, 2019 at 1:14

BENY

324k22 gold badges176 silver badges250 bronze badges

Comments

Hackaholic · Accepted Answer · 2019-05-31 01:10:52Z

2

you can better use apply with join:

>>> df
    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

>>> df.groupby('id')['messages'].apply(lambda x: ' '.join(x))
id
11     I am not driving home Please pick me up I don'...
103              The car already park No need for ticket
104                    I will buy a car I will buy a car
Name: messages, dtype: object

answered May 31, 2019 at 1:10

Hackaholic

19.8k6 gold badges59 silver badges77 bronze badges

Collectives™ on Stack Overflow

Join strings by space using pandas aggregate function

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related