Combine duplicate rows in Pandas

Question

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:

One	Two	Three
A	B	C
B	B	B
C	A	B

In this example I'd like the output to be:

One	Two	Three
ABC	AB	CB

The real dataframe got thousands of rows with eight columns.

The csv from a dataframe-sample:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C

To easier show how desired outcome woud look like:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC

I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?

Does duplicated values appear only in one column or they may be in multiple ones? — mattiatantardini
– mattiatantardini, Commented Nov 4, 2022 at 19:58
They could be in all different columns. It's a lot of different combinations of events which could either be A, B or C. So I'd like to combine these combinations to get the same combinations as the initial dataframe but reduce the number of rows by combining them. — OldSport
– OldSport, Commented Nov 4, 2022 at 20:30
What makes the first two rows "joinable" but not the third? You could look into cluster analysis and define precisely your metrics and merging criteria. Or you could also look into locality sensitive hashing. Basically, you need to tell us more about your merging decision making. — Pierre D
– Pierre D, Commented Nov 4, 2022 at 20:54
Oh, but now you've adjusted the question such that all examples show all the rows collapsed into one. Is that desired? If not, let me ask again: in which cases do you merge rows and when do you not? A precise definition is essential to understand the task. In addition think about the case of "chaining" (elongated clusters) where rows 1 and 2 would be merged, as well as 2 and 3, etc., but not 1 and 3. Then do you not merge some of them? Make several clusters? — Pierre D
– Pierre D, Commented Nov 5, 2022 at 16:04

Sarah Bennett · Accepted Answer · 2022-11-07 16:29:56Z

2

You can use something like this:

df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()

If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.

Also this other post may help: pandas - Merge nearly duplicate rows based on column value

EDIT:

Does this get you the output you're looking for?

joined_df = df.apply(''.join, axis=0)

variation of this: Concatenate all columns in a pandas dataframe

edited Nov 7, 2022 at 16:29

answered Nov 4, 2022 at 19:58

Sarah Bennett

213 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Chris Over a year ago

why not copy the table and use pd.read_clipboard()

OldSport Over a year ago

I've edited the first post to contain more samples from the dataframe to illustrate better.

Sarah Bennett Over a year ago

I did not know that was a thing, @Chris, very helpful!

mattiatantardini · Accepted Answer · 2022-11-04 20:41:16Z

1

If your data are all characters you can end up with this solution and collapse everything to one single row:

import pandas as pd

data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))

Check this answer on how to get unique characters in a string.

answered Nov 4, 2022 at 20:41

mattiatantardini

8571 gold badge12 silver badges39 bronze badges

2 Comments

OldSport Over a year ago

It's a good try but doesn't seem to be working for me. Each cell can hold maximum 3 unique characters.

mattiatantardini Over a year ago

Why cells can hold maximum 3 unique characters? It is a requirement from your use case? If you have more than 3 unique characters in your whole data, how to choose then the merging strategy?

Collectives™ on Stack Overflow

Combine duplicate rows in Pandas

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related