Ignore duplicated values in pandas

Question

I'm trying to implement a simple voting score in a csv file using pandas. Basically, if the `dataframe['C'] == Active and dataframe['Count'] == 0, then dataframe['Combo'] == 0. If dataframe['C'] == Active and dataframe['Count'] == 1; then dataframe['Combo'] == 1. If dataframe['C'] == Active and dataframe['Count'] == 2; then dataframe['Combo'] == 2 and so on.

This is my dataframe:

A        B          C           Count Combo
Ptn1    Lig1        Inactive    0      
Ptn1    Lig1        Inactive    1      
Ptn1    Lig1        Active      2      2
Ptn2    Lig2        Active      0      0
Ptn2    Lig2        Inactive    1       
Ptn3    Lig3        Active      0      0
Ptn3    Lig3        Inactive    1       
Ptn3    Lig3        Inactive    2       
Ptn3    Lig3        Inactive    3      
Ptn3    Lig3        Active      4      3

This is my code so far for clarity:

import pandas as pd
df = pd.read_csv('affinity.csv')
VOTE = 0
df['Combo'] = ''
df.loc[(df['Classification] == 'Active') & (df['Count'] == 0), 'Combo'] = VOTE
df.loc[(df['Classification] == 'Active') & (df['Count'] == 1), 'Combo'] = VOTE + 1
df.loc[(df['Classification] == 'Active') & (df['Count'] == 2), 'Combo'] = VOTE + 2
df.loc[(df['Classification] == 'Active') & (df['Count'] > 3), 'Combo'] = VOTE + 3

My code was able to do this correctly. However, there are two 'Active' values for the pair Ptn3-Lig3; one at dataframe['Count'] = 0 and another at dataframe['Count'] = 4. Is there a way to ignore the second value (i.e. consider only the smallest dataframe['Count'] value) and add the corresponding number to dataframe['Combo']? I know pandas.DataFrame.drop_duplicates()might be a way to select only unique values, but it would be really good avoid deleting any rows.

cs95 · Accepted Answer · 2017-10-20 23:59:55Z

1

You could do a groupby + apply:

def foo(x):
    m = x['C'].eq('Active') 
    if m.any():
       return pd.Series(np.where(m,  x.loc[m, 'Count'].head(1), np.nan))
    else:
       return pd.Series([np.nan] * len(x))

df['Combo'] = df.groupby(['A', 'B'], group_keys=False).apply(foo).values   
print(df) 

      A     B         C  Count Combo
0  Ptn1  Lig1  Inactive      0      
1  Ptn1  Lig1  Inactive      1      
2  Ptn1  Lig1    Active      2     2
3  Ptn2  Lig2    Active      0     0
4  Ptn2  Lig2  Inactive      1      
5  Ptn3  Lig3    Active      0     0
6  Ptn3  Lig3  Inactive      1      
7  Ptn3  Lig3  Inactive      2      
8  Ptn3  Lig3  Inactive      3      
9  Ptn3  Lig3    Active      4     0

Another alternative with groupby + merge:

df = df.groupby(['A', 'B', 'C'])['C', 'Count']\
       .apply(lambda x: x['Count'].values[0] if x['C'].eq('Active').any() else np.nan)\
       .reset_index(name='Combo').fillna('').merge(df)
print(df) 

      A     B         C Combo  Count
0  Ptn1  Lig1    Active     2      2
1  Ptn1  Lig1  Inactive            0
2  Ptn1  Lig1  Inactive            1
3  Ptn2  Lig2    Active     0      0
4  Ptn2  Lig2  Inactive            1
5  Ptn3  Lig3    Active     0      0
6  Ptn3  Lig3    Active     0      4
7  Ptn3  Lig3  Inactive            1
8  Ptn3  Lig3  Inactive            2
9  Ptn3  Lig3  Inactive            3

Note that this ends up sorting your groups.

edited Oct 20, 2017 at 23:59

answered Oct 20, 2017 at 23:28

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Marcos Santana Over a year ago

Thank you. That worked for this sample dataframe, but when I tried to apply it to the real thing it raised an error: return pd.Series(np.where(m, x.loc[m, 'Count'].head(1), '')) ValueError: operands could not be broadcast together with shapes (5,) (0,) (). Could you explain what the function is doing? I'm really new to python and pandas.

cs95 Over a year ago

@MarcosSantana See edit? I think I might've understood the problem.

Marcos Santana Over a year ago

Oh. Just saw it. Now the function is running. But I still get two values for Ptn3-Lig3 pairs. If not by that function, is there a way to change that second value to NaN or something else? Thank you again for that function!

cs95 Over a year ago

@MarcosSantana Made a small change, see if this works?

cs95 Over a year ago

@MarcosSantana Added a new method.

|

Collectives™ on Stack Overflow

Ignore duplicated values in pandas

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related