8

I have a pandas dataframe with 21 columns. I am focusing on a subset of rows that have exactly same column data values except for 6 that are unique to each row. I don't know which column headings these 6 values correspond to a priori.

I tried converting each row to Index objects, and performed set operation on two rows. Ex.

row1 = pd.Index(sample_data[0])
row2 = pd.Index(sample_data[1])
row1 - row2 

which returns an Index object containing values unique to row1. Then I can manually deduce which columns have unique values.

How can I programmatically grab the column headings that these values correspond to in the initial dataframe? Or, is there a way to compare two or multiple dataframe rows and extract the 6 different column values of each row, as well as the corresponding headings? Ideally, it would be nice to generate a new dataframe with the unique columns.

In particular, is there a way to do this using set operations?

Thank you.

2
  • So there's a group of rows which are 15-in-common,6-different and also other rows which don't follow this pattern? [IOW, do we have to detect this "subset of rows" or is that already done?] Commented May 14, 2013 at 0:56
  • can u post a couple of sample rows? Commented May 14, 2013 at 2:02

3 Answers 3

8

Here's a quick solution to return only the columns in which the first two rows differ.

In [13]: df = pd.DataFrame(zip(*[range(5), list('abcde'), list('aaaaa'),
...                              list('bbbbb')]), columns=list('ABCD'))

In [14]: df
Out[14]: 
   A  B  C  D
0  0  a  a  b
1  1  b  a  b
2  2  c  a  b
3  3  d  a  b
4  4  e  a  b

In [15]: df[df.columns[df.iloc[0] != df.iloc[1]]]
Out[15]: 
   A  B
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e

And a solution to find all columns with more than one unique value throughout the entire frame.

In [33]: df[df.columns[df.apply(lambda s: len(s.unique()) > 1)]]
Out[33]: 
   A  B
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e
Sign up to request clarification or add additional context in comments.

Comments

4

You don't really need the index, you could just compare two rows and use that to filter the columns with a list comprehension.

df = pd.DataFrame({"col1": np.ones(10), "col2": np.ones(10), "col3": range(2,12)})
row1 = df.irow(0)
row2 = df.irow(1)
unique_columns = row1 != row2
cols = [colname for colname, unique_column in zip(df.columns, bools) if unique_column]
print cols # ['col3']

If you know the standard value for each column, you can convert all the rows to a list of booleans, i.e.:

standard_row = np.ones(3)
columns = df.columns
unique_columns = df.apply(lambda x: x != standard_row, axis=1)
unique_columns.apply(lambda x: [col for col, unique_column in zip(columns, x) if unique_column], axis=1)

Comments

0

Further to @jeff-tratner's answer

  1. produce truth table of identical cells between two rows (selected in this case by their index positions):

    uq = di2.iloc[0] != di2.iloc[1]

  2. get list of columns of identical cells:

    uq[uq==True].index.to_list()

Or get list of columns of non-identical cells:

uq[uq!=True].index.to_list()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.