6

I have a problem with removing the duplicates. My program is based around a loop which generates tuples (x,y) which are then used as nodes in a graph. The final array/matrix of nodes is :

[[ 1.          1.        ]
[ 1.12273268  1.15322175]
[..........etc..........]
[ 0.94120695  0.77802849]
**[ 0.84301344  0.91660517]**
[ 0.93096269  1.21383287]
**[ 0.84301344  0.91660517]**
[ 0.75506418  1.0798641 ]]

The length of the array is 22. Now, I need to remove the duplicate entries (see **). So I used:

def urows(array):
    df = pandas.DataFrame(array)
    df.drop_duplicates(take_last=True)
    return df.drop_duplicates(take_last=True).values

Fantastic, but I still get :

           0         1
0   1.000000  1.000000
....... etc...........
17  1.039400  1.030320
18  0.941207  0.778028
**19  0.843013  0.916605**
20  0.930963  1.213833
**21  0.843013  0.916605**

So drop duplicates is not removing anything. I tested to see if the nodes where actually the same and I get:

print urows(total_nodes)[19,:]
---> [ 0.84301344  0.91660517]
print urows(total_nodes)[21,:]
---> [ 0.84301344  0.91660517]
print urows(total_nodes)[12,:] - urows(total_nodes)[13,:]
---> [ 0.  0.]

Why is it not working ??? How can I remove those duplicate values ???

One more question....

Say two values are "nearly" equal (say x1 and x2), is there any way to replace them in a way that they are both equal ???? What I want is to replace x2 with x1 if they are "nearly" equal.

3
  • drop_duplicates does preserve order, I don't understand what you're asking... is it possible to simplify this question down? Commented May 2, 2013 at 10:59
  • Thank you. I completely edited and reformulated the question. I realised I was asking the wrong thing in the wrong way. Commented May 2, 2013 at 15:41
  • I don't know Panda but is it possible that a) the entries are different at a later decimal place or b) they are two different lists (that happen to have the same entries) that are compared for object-identity? If neither of this is the case, just ignore my comment... Commented May 2, 2013 at 15:50

2 Answers 2

6

If I copy-paste in your data, I get:

>>> df
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
5  0.843013  0.916605
6  0.755064  1.079864

>>> df.drop_duplicates() 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).

One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4]), then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like

df = df.ix[~df.apply(np.round, args=[4]).duplicated()]

Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:

grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)

This reorders the dataframe, but you can then call .sort() to get them back in the original order if you need that.

Explanation: the first line uses groupby to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4)), but that sucks too).

The second line uses the apply method on groupby to replace the dataframe of near-duplicate rows, g, with a new dataframe g.apply(lambda row: g.irow(0), axis=1). That uses the apply method on dataframes to replace each row with the first row of the group.

The result then looks like

                        0         1
0      1                           
0.7551 1.0799 6  0.755064  1.079864
0.8430 0.9166 3  0.843013  0.916605
              5  0.843013  0.916605
0.9310 1.2138 4  0.930963  1.213833
0.9412 0.7780 2  0.941207  0.778028
1.0000 1.0000 0  1.000000  1.000000
1.1227 1.1532 1  1.122733  1.153222

where groupby has inserted the rounded values as an index. The reset_index line then drops those columns.

Hopefully someone who knows pandas better than I do will drop by and show how to do this better.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your answer! I have another question which came to my mind when I was trying your answer. Is it posible to, say if x1 and x2 are not exactly equal, then change x2 to x1 ?
Do you mean you want to take df and change it so that the almost-duplicated things aren't removed but are changes so they are actually duplicated? I'm not sure how to do that offhand other than something gross with groupby.
yes yes ! I have hideous rounding problems. I am using this to generate nodes in a graph, if x1,x2 are not exactly equal, networkx recognizes them as different nodes, if x1=x2, i get a recombinant tree which is what i want. I can implement this with a simple if, but the running time is O(N^2) which ruins everything. Maybe I should post it as a new question...
Further explanation: so my final goal is to change x1=x2 so that they are exactly equal, this generates 1 node (instead of 2 by the rounding mistak2). Next step, remove the duplicates, and run the code again to generate the next step of the graph.
1

Similar to @Dougal answer, but in a slightly different way

In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]: 
          0         1
0  1.000000  1.000000
1  1.122733  1.153222
2  0.941207  0.778028
3  0.843013  0.916605
4  0.930963  1.213833
6  0.755064  1.079864

4 Comments

Thank you for the answer ! There is no rounding involved right? you are just changing the data type right?
@MiguelHerschberg Multiplying by a million and then casting to an int amounts to (almost) the same thing as rounding to 6 decimal places; the difference is that this always rounds down.
Ohhh brilliant. Can I ask you another question. I have a matrix with values some of which are not exactly equal due to rounding errors. I want to make this non exactly equal values, duplicates. Is it possible to multiply by a millon each entry, then cast to int ??? and in this way get the values duplicated instead of nearly equal ???? Thanks !
ints can be compared as exactly equal, floats can sometimes, so your best best is to work in ints if you need this behavior

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.