Comparing rows of pandas dataframe (rows have some overlapping values)

Question

I have a pandas dataframe with 21 columns. I am focusing on a subset of rows that have exactly same column data values except for 6 that are unique to each row. I don't know which column headings these 6 values correspond to a priori.

I tried converting each row to Index objects, and performed set operation on two rows. Ex.

row1 = pd.Index(sample_data[0])
row2 = pd.Index(sample_data[1])
row1 - row2

which returns an Index object containing values unique to row1. Then I can manually deduce which columns have unique values.

How can I programmatically grab the column headings that these values correspond to in the initial dataframe? Or, is there a way to compare two or multiple dataframe rows and extract the 6 different column values of each row, as well as the corresponding headings? Ideally, it would be nice to generate a new dataframe with the unique columns.

In particular, is there a way to do this using set operations?

Thank you.

So there's a group of rows which are 15-in-common,6-different and also other rows which don't follow this pattern? [IOW, do we have to detect this "subset of rows" or is that already done?] — DSM
– DSM, Commented May 14, 2013 at 0:56

Garrett · Accepted Answer · 2013-05-14 05:11:48Z

8

Here's a quick solution to return only the columns in which the first two rows differ.

In [13]: df = pd.DataFrame(zip(*[range(5), list('abcde'), list('aaaaa'),
...                              list('bbbbb')]), columns=list('ABCD'))

In [14]: df
Out[14]: 
   A  B  C  D
0  0  a  a  b
1  1  b  a  b
2  2  c  a  b
3  3  d  a  b
4  4  e  a  b

In [15]: df[df.columns[df.iloc[0] != df.iloc[1]]]
Out[15]: 
   A  B
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e

And a solution to find all columns with more than one unique value throughout the entire frame.

In [33]: df[df.columns[df.apply(lambda s: len(s.unique()) > 1)]]
Out[33]: 
   A  B
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e

edited May 14, 2013 at 5:11

answered May 14, 2013 at 2:20

Garrett

50.3k6 gold badges64 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jeff Tratner · Accepted Answer · 2013-05-14 03:32:09Z

4

You don't really need the index, you could just compare two rows and use that to filter the columns with a list comprehension.

df = pd.DataFrame({"col1": np.ones(10), "col2": np.ones(10), "col3": range(2,12)})
row1 = df.irow(0)
row2 = df.irow(1)
unique_columns = row1 != row2
cols = [colname for colname, unique_column in zip(df.columns, bools) if unique_column]
print cols # ['col3']

If you know the standard value for each column, you can convert all the rows to a list of booleans, i.e.:

standard_row = np.ones(3)
columns = df.columns
unique_columns = df.apply(lambda x: x != standard_row, axis=1)
unique_columns.apply(lambda x: [col for col, unique_column in zip(columns, x) if unique_column], axis=1)

answered May 14, 2013 at 3:32

Jeff Tratner

17.3k4 gold badges50 silver badges68 bronze badges

Comments

markling · Accepted Answer · 2021-10-13 12:03:48Z

0

Further to @jeff-tratner's answer

produce truth table of identical cells between two rows (selected in this case by their index positions):

uq = di2.iloc[0] != di2.iloc[1]
get list of columns of identical cells:

uq[uq==True].index.to_list()

Or get list of columns of non-identical cells:

uq[uq!=True].index.to_list()

answered Oct 13, 2021 at 12:03

markling

1,4241 gold badge20 silver badges30 bronze badges

Collectives™ on Stack Overflow

Comparing rows of pandas dataframe (rows have some overlapping values)

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related