Pandas dataframe compare multiple rows with specific condition

Question

I hope I can explain my problem correctly.. I have a dataframe (306x40) with multiple rows containing data of certain group, I need to group them by index, that's no problem. Next I need to compare the rows with another row that has a specific condition. Take this data as an example:

Id  Condition   var1   var2   var3
1      1         0      1      0
1      3         1      1      0
2      2         0      0      1
2      3         0      0      1
2      1         0      0      1

So I need to compare var1, var2, var3 per Id with the row that has condition 3.. The outcome would let me know that any of the variables is different than the variables in condition 3. Possible outputs:

Id  Condition   var1   var2   var3  count_false
1      1        false  true   true      1
1      3        NaN    NaN    NaN       NaN
2      2        true   true   true      0
2      3        NaN    NaN    NaN       NaN
2      1        true   true   true      0

Or just simply saying that condition x is not the same as condition 3 for y Id

I hope you guys understand what I need, but I'm happy to elaborate

Thank you so much in advance!

edit to make it more clear:

The data consists of data from 3 different methods to detect bacteria in patients (clinical data), note that per patient it differs which method was used and how many times. So I have method 1, 2, 3, these are the different conditions. The variables are the different kinds of bacteria found. Method 1 and 2 are the golden standard and method 3 is the one that needs to be validated. So I want to see if method 3 gives the same result as method 1 or 2. the 30+ variables are the bacteria and value 1 indicates present bacteria and 0 not present.

I've read this about three times now and I still don't understand what you want to do. Why are some rows replaced with NaN? — Pierre D
– Pierre D, Commented Dec 28, 2020 at 17:41
Thank you for taking your time. I edited my post to make it hopefully more clear. I don't know why I put NaN there, just to show that the variables with method 1 and 2 needs to be compared with method 3. — Ward Nieboer
– Ward Nieboer, Commented Dec 28, 2020 at 18:45
so you mean that each rows var3 can be computed just from var1 and var2 of that row, i.e. regardless of other rows? Why the groupby then? — Pierre D
– Pierre D, Commented Dec 28, 2020 at 18:58
Not exactly, sorry for the unclarity. So within 1 patient, there are multiple methods used to find bacteria, in the example df I named that column the condition value 1 (method 1), 2 (methd 2) or 3 method (3). With these different methods we found many different bacteria, in the example df I called those columns var1, var2, var3 ... etc. value 0 not present, 1 present. I need to check if method 3 resulted in the same findings as method 1 and 2, so are var1, var2, var3 the same for method 3 and 1 or 2. In the example df: are the values for var1,2,3 in the second row the same as the first row — Ward Nieboer
– Ward Nieboer, Commented Dec 28, 2020 at 22:18
and for patient 2, are var1,var2,var3 different in row 3 and 5 compared to row 4 (method 3 used) — Ward Nieboer
– Ward Nieboer, Commented Dec 28, 2020 at 22:19

Pierre D · Accepted Answer · 2020-12-28 19:42:01Z

If I understand correctly, this is just about counting how many var1 .. var2 are different than var3, row by row. There is no grouping involved.

If that's the case (happy to modify if not), then:

cols = 'var1 var2 var3'.split()  # can be extended to more variables
df['count_diff'] = (df[cols[:-1]].values != df[cols[-1]].values[:, None]).sum(axis=1)

On a modified example:

df = pd.DataFrame({
    'Id': [1, 1, 2, 2, 2],
    'Condition': [1, 3, 2, 3, 1],
    'var1': [0, 1, 0, 1, 0],
    'var2': [1, 1, 0, 1, 0],
    'var3': [0, 0, 1, 1, 1]})

cols = 'var1 var2 var3'.split()
df['count_diff'] = (df[cols[:-1]].values != df[cols[-1]].values[:, None]).sum(axis=1)
print(df)

# output:
   Id  Condition  var1  var2  var3  count_diff
0   1          1     0     1     0           1
1   1          3     1     1     0           2
2   2          2     0     0     1           2
3   2          3     1     1     1           0
4   2          1     0     0     1           2

Chandan · Accepted Answer · 2020-12-29 08:20:38Z

0

#x!=x.loc[i] gives you the truth value of each row when comparing with your desired(condition) row. Rest of the code is below:

df=pd.DataFrame([[1,1,0,1,0],[1,3,1,1,0],[2,2, 0,0,1],[2,3, 0,0,1],[2,1, 0,0,1]],columns=['ID','Condition','var1','var2','var3'])
def comp(x):
    i=x.index[x['Condition']==3][0]
    #print(x!=x.loc[i])
    truth=(x[['var1','var2','var3']]==x.loc[i,['var1','var2','var3']])
    x=pd.concat([x[['ID','Condition']],truth],axis=1)
    return x
grp=df.groupby('ID').apply(comp)
grp['count_false']=grp[['var1','var2','var3']].apply(lambda x: 3-x.sum(),axis=1)
print(grp)

edited Dec 29, 2020 at 8:20

Chandan

11.8k1 gold badge10 silver badges25 bronze badges

answered Dec 29, 2020 at 7:35

user3774410

164 bronze badges

1 Comment

Ward Nieboer Over a year ago

This is exactly what I need, thank you so much! The only issue is the false count, it gives huge numbers, I'm trying to fix it now.

Collectives™ on Stack Overflow

Pandas dataframe compare multiple rows with specific condition

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related