drop duplicates in Python Pandas DataFrame not removing duplicates

Question

I have a problem with removing the duplicates. My program is based around a loop which generates tuples (x,y) which are then used as nodes in a graph. The final array/matrix of nodes is :

[[ 1.          1.        ]
[ 1.12273268  1.15322175]
[..........etc..........]
[ 0.94120695  0.77802849]
**[ 0.84301344  0.91660517]**
[ 0.93096269  1.21383287]
**[ 0.84301344  0.91660517]**
[ 0.75506418  1.0798641 ]]

The length of the array is 22. Now, I need to remove the duplicate entries (see **). So I used:

def urows(array):
    df = pandas.DataFrame(array)
    df.drop_duplicates(take_last=True)
    return df.drop_duplicates(take_last=True).values

Fantastic, but I still get :

           0         1
0   1.000000  1.000000
....... etc...........
17  1.039400  1.030320
18  0.941207  0.778028
**19  0.843013  0.916605**
20  0.930963  1.213833
**21  0.843013  0.916605**

So drop duplicates is not removing anything. I tested to see if the nodes where actually the same and I get:

print urows(total_nodes)[19,:]
---> [ 0.84301344  0.91660517]
print urows(total_nodes)[21,:]
---> [ 0.84301344  0.91660517]
print urows(total_nodes)[12,:] - urows(total_nodes)[13,:]
---> [ 0.  0.]

Why is it not working ??? How can I remove those duplicate values ???

One more question....

Say two values are "nearly" equal (say x1 and x2), is there any way to replace them in a way that they are both equal ???? What I want is to replace x2 with x1 if they are "nearly" equal.

drop_duplicates does preserve order, I don't understand what you're asking... is it possible to simplify this question down? — Andy Hayden
– Andy Hayden, Commented May 2, 2013 at 10:59
Thank you. I completely edited and reformulated the question. I realised I was asking the wrong thing in the wrong way. — Oniropolo
– Oniropolo, Commented May 2, 2013 at 15:41
I don't know Panda but is it possible that a) the entries are different at a later decimal place or b) they are two different lists (that happen to have the same entries) that are compared for object-identity? If neither of this is the case, just ignore my comment... — Elmar Peise
– Elmar Peise, Commented May 2, 2013 at 15:50

Danica · Accepted Answer · 2013-05-02 18:06:52Z

6

If I copy-paste in your data, I get:

>>> df
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
5  0.843013  0.916605
6  0.755064  1.079864

>>> df.drop_duplicates() 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).

One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4]), then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like

df = df.ix[~df.apply(np.round, args=[4]).duplicated()]

Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:

grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)

This reorders the dataframe, but you can then call .sort() to get them back in the original order if you need that.

Explanation: the first line uses groupby to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4)), but that sucks too).

The second line uses the apply method on groupby to replace the dataframe of near-duplicate rows, g, with a new dataframe g.apply(lambda row: g.irow(0), axis=1). That uses the apply method on dataframes to replace each row with the first row of the group.

The result then looks like

                        0         1
0      1                           
0.7551 1.0799 6  0.755064  1.079864
0.8430 0.9166 3  0.843013  0.916605
              5  0.843013  0.916605
0.9310 1.2138 4  0.930963  1.213833
0.9412 0.7780 2  0.941207  0.778028
1.0000 1.0000 0  1.000000  1.000000
1.1227 1.1532 1  1.122733  1.153222

where groupby has inserted the rounded values as an index. The reset_index line then drops those columns.

Hopefully someone who knows pandas better than I do will drop by and show how to do this better.

edited May 2, 2013 at 18:06

answered May 2, 2013 at 15:48

Danica

29k6 gold badges94 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Oniropolo Over a year ago

Thank you for your answer! I have another question which came to my mind when I was trying your answer. Is it posible to, say if x1 and x2 are not exactly equal, then change x2 to x1 ?

Danica Over a year ago

Do you mean you want to take df and change it so that the almost-duplicated things aren't removed but are changes so they are actually duplicated? I'm not sure how to do that offhand other than something gross with groupby.

Oniropolo Over a year ago

yes yes ! I have hideous rounding problems. I am using this to generate nodes in a graph, if x1,x2 are not exactly equal, networkx recognizes them as different nodes, if x1=x2, i get a recombinant tree which is what i want. I can implement this with a simple if, but the running time is O(N^2) which ruins everything. Maybe I should post it as a new question...

Oniropolo Over a year ago

Further explanation: so my final goal is to change x1=x2 so that they are exactly equal, this generates 1 node (instead of 2 by the rounding mistak2). Next step, remove the duplicates, and run the code again to generate the next step of the graph.

Jeff · Accepted Answer · 2013-05-02 16:09:13Z

1

Similar to @Dougal answer, but in a slightly different way

In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]: 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

answered May 2, 2013 at 16:09

Jeff

130k21 gold badges223 silver badges189 bronze badges

4 Comments

Oniropolo Over a year ago

Thank you for the answer ! There is no rounding involved right? you are just changing the data type right?

Danica Over a year ago

@MiguelHerschberg Multiplying by a million and then casting to an int amounts to (almost) the same thing as rounding to 6 decimal places; the difference is that this always rounds down.

Oniropolo Over a year ago

Ohhh brilliant. Can I ask you another question. I have a matrix with values some of which are not exactly equal due to rounding errors. I want to make this non exactly equal values, duplicates. Is it possible to multiply by a millon each entry, then cast to int ??? and in this way get the values duplicated instead of nearly equal ???? Thanks !

Jeff Over a year ago

ints can be compared as exactly equal, floats can sometimes, so your best best is to work in ints if you need this behavior

Collectives™ on Stack Overflow

drop duplicates in Python Pandas DataFrame not removing duplicates

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related