3

I have a CSV file that contains rows with some similar id's. I found a nice approach doing this with dataframe and I found the code doing this from [this] 1 post.

Sample CSV file:

    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

Desired output is:

Sample CSV file:

id   messages        
011   I am not driving home Please pick me up I don't have money     
103   The car already park No need for ticket         
104   I will buy a car              

Now the code that I have so far is:

aggregation_functions = {'message':'sum'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)

Now what I am getting with this code is:

id   messages        
011   I am not driving homePlease pick me upI don't have money      
103   The car already parkNo need for ticket         
104   I will buy a car 

I just want to have the space between words (eg. "homePlease" > "home Please") and avoid redundancy such as having two times of I will buy a car.

I already checked the post 2 but I couldn't find my answer.

Also do i need to use .reindex(columns=df.columns) after the aggregate(aggregation_functions)

Like:

df_new = df.groupby(df['id']).aggregate(aggregation_functions).reindex(columns=df.columns)
4
  • change to aggregation_functions = {'message':' '.join} Commented May 31, 2019 at 1:03
  • @WeNYoBen Thank you. It worked. I also wondering if reindex is necessary to use? thanks Commented May 31, 2019 at 1:10
  • Reindex has nothing to do here, that only changes the index/columns and/or its ordering. You want to aggregate the contents, not modify the structure. Commented May 31, 2019 at 1:15
  • @cs95 thank you for your comment. I understand now. Commented May 31, 2019 at 1:16

3 Answers 3

3

To remove redundancy, I suggest GroupBy.unique followed by str.join:

df.groupby('id')['messages'].unique().str.join(' ')

Alternatively, using GroupBy.agg with set + ' '.join:

df.groupby('id')['messages'].agg(lambda x: ' '.join(set(x)))

Both print,

# id
# 11     I don't have money I am not driving home Pleas...
# 103              No need for ticket The car already park
# 104                                     I will buy a car
# Name: messages, dtype: object

To return a DataFrame, call reset_index at the end... for example,

df.groupby('id')['messages'].unique().str.join(' ').reset_index()

#     id                                           messages
# 0   11  I am not driving home Please pick me up I don'...
# 1  103            The car already park No need for ticket
# 2  104                                   I will buy a car
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your comment. I need to have the header id and message, because I will use them in my dataframe. But based on your code and output, I am not able to have the header. Can you help. Thanks
@Bilgin It's as simple as result = df.groupby('id')['messages'].unique().str.join(' ').reset_index()
thanks. Sorry, i am new in python. how about if I use df.groupby('id')['messages'].unique().str.join(' ')? is it the same?
@Bilgin That returns a Series with ID as the index. If you call .reset_index() on that result, it converts the index into a column, and the overall result is a DataFrame. Take a look at the code and let me know if there are other issues.
I understand now. Thanks. Now it is returning the header. One last question is that how can I get rid of that index number on the left 0,1,2,... I am doing sum results generated and saving them into CSV file and those numbers appearing in the CSV file with no header and it is confusing. Thanks for your time
|
3

So it will be drop_duplicates first , the agg join

df.drop_duplicates().groupby('id',as_index=False).messages.agg(' '.join)

Comments

2

you can better use apply with join:

>>> df
    id               messages
0   11  I am not driving home
1   11      Please pick me up
2   11     I don't have money
3  103   The car already park
4  103     No need for ticket
5  104       I will buy a car
6  104       I will buy a car

>>> df.groupby('id')['messages'].apply(lambda x: ' '.join(x))
id
11     I am not driving home Please pick me up I don'...
103              The car already park No need for ticket
104                    I will buy a car I will buy a car
Name: messages, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.