python - Removing duplicate rows under conditions in Pandas

Question

I have a DataFrame like this:

  NoDemande   NoUsager  Sens  IdVehiculeUtilise  Fait  HeurePrevue  HeureDebutTrajet
0 42191000823  001208    +         246Véh         1    08:20:04     08:22:26 
1 42191000822  001208    +         246Véh         1    08:20:04     08:18:56 
2 42191000822  001208    -         246Véh        -99   09:05:03     08:56:26 
3 42191000823  001208    -         246Véh         1    09:05:03     08:56:26 
4 42191000834  001208    +         246Véh         1    16:50:04     16:39:26 
5 42191000834  001208    -         246Véh         1    17:45:03     17:25:10 
6 42192000761  001208    +         246Véh        -1    08:20:04     08:15:07 
7 42192000762  001208    +         246Véh         1    08:20:04     08:18:27 
8 42192000762  001208    -         246Véh        -99   09:05:03     08:58:29 
9 42192000761  001208    -         246Véh        -11   09:05:03     08:58:29

I get this data frame fromdf[df.duplicated(['NoUsager','NoDemande'],keep=False)]which ensure my rows being in pair. I want to drop a pair of rows when NoDemande are continuous numbers (like 42191000822 and 42191000823, 42192000761 and 42192000762) and the column HeurePrevue are the same, which means the records are recorded twice. I have to delete a pair and I'd like to preseve the one with more positive numbers in column Fait(at least one greater than 0)

So my result should look like:

  NoDemande   NoUsager  Sens  IdVehiculeUtilise  Fait  HeurePrevue  HeureDebutTrajet
0 42191000823  001208    +         246Véh         1    08:20:04     08:22:26 
3 42191000823  001208    -         246Véh         1    09:05:03     08:56:26 
4 42191000834  001208    +         246Véh         1    16:50:04     16:39:26 
5 42191000834  001208    -         246Véh         1    17:45:03     17:25:10 
7 42192000762  001208    +         246Véh         1    08:20:04     08:18:27 
8 42192000762  001208    -         246Véh        -99   09:05:03     08:58:29

I know it's something about OR logic but I have no idea how to realize it.

Any help will be appreciated~

because -99 is line 8, and -11 is line 9. I deleted line 6 and 9 as a pair. — ch36r5s
– ch36r5s, Commented Sep 5, 2016 at 5:15
6 and 9 are in pairs because they are the same in NoDemande, the same goes to 7 and 8. I deleted 6 and 9 because in Fait, both of them are negative while only one negative between 7 and 8. 6 and 7, 8 and 9 are the same in HeurePrevue. — ch36r5s
– ch36r5s, Commented Sep 5, 2016 at 5:32
You would better add that last comment in your question because your wording is confusing — Zeugma
– Zeugma, Commented Sep 5, 2016 at 5:37

PdevG · Accepted Answer · 2016-09-05 13:14:11Z

My approach on this problem was to make two columns which contain the conditions for a check (same heure and continuous increasing NoDemande). Then iterate over the dataframe dropping the pairs you do not want based on the Fait columns.

It's a bit of a hacky code but this seems to do the trick:

# Recreate DataFrame
df = pd.DataFrame({
    'NoDemande': [23, 22, 22, 23, 34, 34, 61, 62, 62, 61],
    'HeurePrevue': [84, 84, 93, 93, 64, 73, 84, 84, 93, 93],
    'Fait': [1, 1, -99, 1, 1, 1, -1, 1, -99, -11]
    }, columns=['NoDemande', 'Fait', 'HeurePrevue'])

# Make columns which contain conditions for inspection
df['sameHeure'] = df.HeurePrevue.iloc[1:] == df.HeurePrevue.iloc[:-1]
df['cont'] = df.NoDemande.diff()

# Cycle over rows
for prev_row, row in zip(df.iloc[:-1].itertuples(), df.iloc[1:].itertuples()):
    if row.sameHeure and (row.cont == 1):  # If rows are continuous and have the same Heure delete a pair
        pair_1 = df.loc[df.NoDemande == row.NoDemande]
        pair_2 = df.loc[df.NoDemande == prev_row.NoDemande]
        if sum(pair_1.Fait > 0) < sum(pair_2.Fait > 0):  # Find which pair to delete
            df.drop(pair_1.index, inplace=True)
        else:
            df.drop(pair_2.index, inplace=True)

df.drop(['cont', 'sameHeure'], 1, inplace=True)  # Throw away the added columns

result:

print(df)

   NoDemande  Fait  HeurePrevue
0         23     1           84
3         23     1           93
4         34     1           64
5         34     1           73
7         62     1           84
8         62   -99           93

vlad.rad · Accepted Answer · 2016-09-05 10:39:11Z

I see here two solutions. The first is based on the suggestion, that you have always continuous pairs of entries in your dataset - that if any entry has a pair, this pair comes after this entry. Then you should loop over your dataframe with step size = 2:

for i in range(0,x,2):
  your action

And in this loop you can compare your two entries and remove the one that has a negative value.

My second proposition is a little bit complex.

First you should copy and lag (shift by specific number of rows) all columns. This can be done with following function (applied only on NoDemande, for doing so to every column use a loop):

df.NoDemande = df.NoDemande.shift(-1)

It will look like:

  NoDemande      NoDemande_lagged

0 42191000823    42191000822
1 42191000822    42191000822 
2 42191000822    42191000823
3 42191000823    42191000834

Then compare the two values in the same row in NoDemande and NoDemande_lagged columns. If the number from 42191000822 is greater or smaller by 1 than the value in NoDemande, then compare Fait and Fait_lagged and choose the more positive value, which you should paste in the new column Fait_selected. The same you should do with other columns, so that every column will have a lagged copy and a selected copy. Afterwards you should remove your next row, because you have already compared it with the previous one. At the end you should delete your original and lagged colums and leave only the "_selected".

Sorry for a complex Explanation, hope, that this will help you anyway. If you are familiar with RapidMiner, I can explain how to do this there, it will be easier. And I gave you some ideas for various concepts that can help you to solve your Problem.

Khris · Accepted Answer · 2016-09-05 12:54:46Z

This is a long-winded solution, there might be shorter ones. frame0 is the exact frame you posted above.

First take the data, sort it by NoDemande, split it and recombine it so you have two pairings in the same row. Makes things a lot easier:

frame0.HeurePrevue = pd.to_datetime(frame0.HeurePrevue)
frame0 = frame0.sort_values('NoDemande').reset_index(drop=True)
frameA = frame0.iloc[::2].reset_index(drop=True)
frameB = frame0.iloc[1::2].reset_index(drop=True)
frame1 = pd.concat([frameA,frameB],axis=1,join='inner')
frame1.columns = [u'NoDemande1', u'NoUsager1', u'Sens1', u'IdVehiculeUtilise1', u'Fait1',\
                  u'HeurePrevue1', u'HeureDebutTrajet1', u'NoDemande2', u'NoUsager2', u'Sens2',\
                  u'IdVehiculeUtilise2', u'Fait2', u'HeurePrevue2', u'HeureDebutTrajet2']
frame1 = frame1[[u'NoDemande1', u'Fait1',u'HeurePrevue1', u'NoDemande2',u'Fait2',\
                 u'HeurePrevue2']]

Next do some comparisons to see if in a given row the row ABOVE that row is a duplicate or not:

frame2 = frame1[['NoDemande1','NoDemande2','HeurePrevue1','HeurePrevue2']].diff()
frame2['lastColumnsPartner'] = (frame2.NoDemande1 == 1) & (frame2.NoDemande2 == 1) &\
                               (frame2.HeurePrevue1 == pd.Timedelta(0)) &\
                               (frame2.HeurePrevue2 == pd.Timedelta(0))
frame2 = frame2['lastColumnsPartner'].to_frame()
frame1 = pd.merge(frame1,frame2,left_index=True,right_index=True)

Now check the values of Fait:

frame1['Fait1Pos'] = 0
frame1['Fait2Pos'] = 0
frame1.ix[frame1.Fait1>0,'Fait1Pos'] = 1
frame1.ix[frame1.Fait2>0,'Fait2Pos'] = 1
frame1['FaitPos'] = frame1.Fait1Pos+frame1.Fait2Pos
frame1['FaitBool'] = (frame1.Fait1 > 0)|(frame1.Fait2 > 0)

Iterate over all rows and use the boolean lastColumnsPartner to create a new index which identifies duplicate rows:

frame1['newIndex'] = 0
j = -1
for i,row in frame1.iterrows():
  if frame1.ix[i,'lastColumnsPartner'] == False:
    j+=1
  frame1.ix[i,'newIndex'] = j

Take only rows with at least one positive value in Fait (FaitBool), sort by number of positive values of Fait (FaitPos), drop duplicates (newIndex) to keep only the highest value of Fait, then return NoDemande.

tokeep = frame1[frame1.FaitBool][['NoDemande1','newIndex','FaitPos']]\
 .sort_values('FaitPos',ascending=False).drop_duplicates('newIndex')['NoDemande1']

Finally use boolean indexing on the initial frame to filter everything.

frame0 = frame0[frame0.NoDemande.isin(tokeep)]

I can't say for sure if it works for all cases, it works for your example. Also there is probably room for improvement.

Collectives™ on Stack Overflow

python - Removing duplicate rows under conditions in Pandas

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related