Cannot apply method of finding unique rows in numpy.darray in python

Question

I'm trying to choose only unique rows in numpy.ndarray (variable named cluster). When I define this variable explicitely like here:

cluster=np.array([[0.157,-0.4778],[0.157,-0.4778],[0.157,-0.4778],[-0.06156924,-0.21786049],[-0.06156924,-0.21786049],[0.02,-0.35]])

it works as it should:

[[ 0.157      -0.4778    ]
 [-0.06156924 -0.21786049]
 [ 0.02       -0.35      ]]

But unfortunately this variable cluster is a part of a bigger array (xtrans). So it can be defined only through array slicing:

splitted_clusters=[0,1,4,5,10]

cluster=xtrans[splitted_clusters]

The functions are the same, the data types are the same.

BUT!!! in latter case it works quite weird: it may add identical rows or it may not add them. As a result I have something like this:

    [[ 0.157      -0.4778    ]
     [ 0.157      -0.4778    ]
     [-0.06156924 -0.21786049]
     [ 0.02       -0.35      ]]

In my real example with an 44*2 array it adds 22 identical rows and it misses 23 of them (the scheme is quite strange too: it adds rows with indices 0,1,2,4,9,11,12,18 etc). But the number of added identical rows differs. AND it is supposed to add only ONE (the first) row of these 44 rows.

As for method of choosing unique rows firstly I used one from this thread Find unique rows in numpy.array

b =np.ascontiguousarray(cluster).view(np.dtype((np.void, cluster.dtype.itemsize * cluster.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_cl = cluster[idx]

Then I've tried my code to check:

unique_cl=np.array([0,0])
for i in range(cluster.shape[0]):
    if i==0:
        unique_cl=np.vstack([cluster[i,:]])
    elif cluster[i,:].tolist() not in unique_cl.tolist():
        unique_cl=np.vstack([unique_cl,cluster[i,:]])

The results are the same and I really have no idea why. I would be very grateful for any help/advice/suggestion/idea.

The problem was in floats. When I rounded values of array to 7 decimal places everything works as should. Thank Eelco Hoogendoorn for this idea.

Is b the same? It looks like b is the same data, but each row is viewed as a 16 bytes 'void' element. That allows unique to do its flattened sort and selection. — hpaulj
– hpaulj, Commented Apr 2, 2016 at 20:37
@hpaulj I suppose yes, as there is no other b in this code. It's of type 'numpy.ndarray' as well but when I try to print it I see strange symbols and I don't know how encode/decode them: [��|гY�? 9��v��? � h"lx�? @ ��|гY�? 9��v��? � h"lx�? ��|гY�? 9��v��? � h"lx�? �K7�A�? 9��v��? F��x�? ��|гY�? 9��v��? � h"lx�? @ ��|гY�? 9��v��? � h"lx�? ��|гY�? 9��v��? � h"lx�? @ @ @]` — Nataly
– Nataly, Commented Apr 2, 2016 at 20:44
What is the shape and dtype of the b generated from xtrans[splitted_clusters]? We can't debug your problem with out a sample of xtrans or idea of how that gives transformed to produce the new b. — hpaulj
– hpaulj, Commented Apr 2, 2016 at 20:49
Could this be a floating-point precision issue? ie, the floats look the same when printed, but are actually not bitwise-identical? Try using np.round and see if that makes a difference. — Eelco Hoogendoorn
– Eelco Hoogendoorn, Commented Apr 2, 2016 at 21:07
Attempting to perform equality tests on general floating point values is tricky. Try xtrans[i,:]==xtrans[j,:] for any two rows that you think are identical. Or look xtrans[i,:]-xtrans[j,:]. The rows might not be as unique as you think. — hpaulj
– hpaulj, Commented Apr 2, 2016 at 22:39

Community · Accepted Answer · 2017-05-23 12:24:08Z

2

You can do it by converting list to set.

 aList = [[ 0.157, -0.4778], [ 0.157, -0.4778],[-0.06156924,
 -0.21786049], [ 0.02, -0.35]]

Make a list of tuples from list of lists, otherwise you will not be able to create set or dictionary from it.
Set constructor will do rest for you

set([tuple(a) for a in aList])

Output:

set([(-0.06156924, -0.21786049), (0.02, -0.35), (0.157, -0.4778)])

edited May 23, 2017 at 12:24

CommunityBot

11 silver badge

answered Apr 2, 2016 at 20:27

Rudziankoŭ

11.3k27 gold badges114 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rudziankoŭ Over a year ago

Then, of course, you can convert it back to two dimensional list

Nataly Over a year ago

thank you for this idea, but I need to save the original indices of the array. For example in first code in my question the indices are in variable idx

Eelco Hoogendoorn · Accepted Answer · 2016-04-02 21:11:21Z

1

The numpy_indexed package (disclaimer: I am its author) implements functionality of this kind, in a manner similar to the solution you posted. But hopefully, its units tests will prove themselves useful, and things work as expected... Could you give it a try on your dataset, and see if it has the same problem?

import numpy_indexed as npi
npi.unique(cluster)
# try this as well, to see if fp representation has something to do with it
npi.unique(cluster.round(4))

edited Apr 2, 2016 at 21:11

answered Apr 2, 2016 at 21:04

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Comments

Jon · Accepted Answer · 2018-02-22 00:53:06Z

0

A solution to finding unique rows in your numpy array would be

In [13]: uniq_vals, counts = np.unique(cluster, axis=0, return_counts=True)

In [14]: uniq_vals
Out[14]:
array([[-0.06156924, -0.21786049],
       [ 0.02      , -0.35      ],
       [ 0.157     , -0.4778    ]])

In [15]: counts
Out[15]: array([2, 1, 3], dtype=int64)

The option return_counts allows you to obtain the counts of unique rows.

This solution is explained in Find unique rows in numpy.array

answered Feb 22, 2018 at 0:53

Jon

2,6171 gold badge28 silver badges34 bronze badges

Collectives™ on Stack Overflow

Cannot apply method of finding unique rows in numpy.darray in python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related