1

I got a dataframe where some rows contains almost duplicate values. I'd like to combine these rows as much as possible to reduce the row numbers. Let's say I got following dataframe:

One Two Three
A B C
B B B
C A B

In this example I'd like the output to be:

One Two Three
ABC AB CB

The real dataframe got thousands of rows with eight columns.

The csv from a dataframe-sample:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
A,A,A,A,A,A,A,A
A,A,A,A,A,A,A,B
A,A,A,A,A,A,A,C
A,A,A,A,A,A,B,A
A,A,A,A,A,A,B,B
A,A,A,A,A,A,B,C
A,A,A,A,A,A,C,A
A,A,A,A,A,A,C,B
A,A,A,A,A,A,C,C
C,C,C,C,C,C,A,A
C,C,C,C,C,C,A,B
C,C,C,C,C,C,A,C
C,C,C,C,C,C,B,A
C,C,C,C,C,C,B,B
C,C,C,C,C,C,B,C
C,C,C,C,C,C,C,A
C,C,C,C,C,C,C,B
C,C,C,C,C,C,C,C

To easier show how desired outcome woud look like:

Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8
AC,AC,AC,AC,AC,AC,ABC,ABC

I've tried some code but I end up in real long code snippets which I doubt could be the best and most natural solution. Any suggestions?

5
  • Does duplicated values appear only in one column or they may be in multiple ones? Commented Nov 4, 2022 at 19:58
  • 1
    They could be in all different columns. It's a lot of different combinations of events which could either be A, B or C. So I'd like to combine these combinations to get the same combinations as the initial dataframe but reduce the number of rows by combining them. Commented Nov 4, 2022 at 20:30
  • What makes the first two rows "joinable" but not the third? You could look into cluster analysis and define precisely your metrics and merging criteria. Or you could also look into locality sensitive hashing. Basically, you need to tell us more about your merging decision making. Commented Nov 4, 2022 at 20:54
  • My bad, adjusted. Commented Nov 4, 2022 at 23:17
  • Oh, but now you've adjusted the question such that all examples show all the rows collapsed into one. Is that desired? If not, let me ask again: in which cases do you merge rows and when do you not? A precise definition is essential to understand the task. In addition think about the case of "chaining" (elongated clusters) where rows 1 and 2 would be merged, as well as 2 and 3, etc., but not 1 and 3. Then do you not merge some of them? Make several clusters? Commented Nov 5, 2022 at 16:04

2 Answers 2

2

You can use something like this:

df = df.groupby(['Two'])['One','Three'].apply(''.join).reset_index()

If you can provide a small bit of code that creates the first df it'd be easier to try out solutions.

Also this other post may help: pandas - Merge nearly duplicate rows based on column value

EDIT:

Does this get you the output you're looking for?

joined_df = df.apply(''.join, axis=0)

variation of this: Concatenate all columns in a pandas dataframe

Sign up to request clarification or add additional context in comments.

3 Comments

why not copy the table and use pd.read_clipboard()
I've edited the first post to contain more samples from the dataframe to illustrate better.
I did not know that was a thing, @Chris, very helpful!
1

If your data are all characters you can end up with this solution and collapse everything to one single row:

import pandas as pd

data = pd.read_csv("path/to/data")
collapsed = data.astype(str).sum().applymap(lambda x: ''.join(set(x)))

Check this answer on how to get unique characters in a string.

2 Comments

It's a good try but doesn't seem to be working for me. Each cell can hold maximum 3 unique characters.
Why cells can hold maximum 3 unique characters? It is a requirement from your use case? If you have more than 3 unique characters in your whole data, how to choose then the merging strategy?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.