3

I'm trying to choose only unique rows in numpy.ndarray (variable named cluster). When I define this variable explicitely like here:

cluster=np.array([[0.157,-0.4778],[0.157,-0.4778],[0.157,-0.4778],[-0.06156924,-0.21786049],[-0.06156924,-0.21786049],[0.02,-0.35]])

it works as it should:

[[ 0.157      -0.4778    ]
 [-0.06156924 -0.21786049]
 [ 0.02       -0.35      ]]

But unfortunately this variable cluster is a part of a bigger array (xtrans). So it can be defined only through array slicing:

splitted_clusters=[0,1,4,5,10]

cluster=xtrans[splitted_clusters]

The functions are the same, the data types are the same.

BUT!!! in latter case it works quite weird: it may add identical rows or it may not add them. As a result I have something like this:

    [[ 0.157      -0.4778    ]
     [ 0.157      -0.4778    ]
     [-0.06156924 -0.21786049]
     [ 0.02       -0.35      ]]

In my real example with an 44*2 array it adds 22 identical rows and it misses 23 of them (the scheme is quite strange too: it adds rows with indices 0,1,2,4,9,11,12,18 etc). But the number of added identical rows differs. AND it is supposed to add only ONE (the first) row of these 44 rows.

As for method of choosing unique rows firstly I used one from this thread Find unique rows in numpy.array

b =np.ascontiguousarray(cluster).view(np.dtype((np.void, cluster.dtype.itemsize * cluster.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_cl = cluster[idx]

Then I've tried my code to check:

unique_cl=np.array([0,0])
for i in range(cluster.shape[0]):
    if i==0:
        unique_cl=np.vstack([cluster[i,:]])
    elif cluster[i,:].tolist() not in unique_cl.tolist():
        unique_cl=np.vstack([unique_cl,cluster[i,:]]) 

The results are the same and I really have no idea why. I would be very grateful for any help/advice/suggestion/idea.

The problem was in floats. When I rounded values of array to 7 decimal places everything works as should. Thank Eelco Hoogendoorn for this idea.

6
  • Is b the same? It looks like b is the same data, but each row is viewed as a 16 bytes 'void' element. That allows unique to do its flattened sort and selection. Commented Apr 2, 2016 at 20:37
  • @hpaulj I suppose yes, as there is no other b in this code. It's of type 'numpy.ndarray' as well but when I try to print it I see strange symbols and I don't know how encode/decode them: [��|гY�? 9��v���? � h"lx�? @ ��|гY�? 9��v���? � h"lx�? ��|гY�? 9��v���? � h"lx�? �K7�A�? 9��v���? F����x�? ��|гY�? 9��v���? � h"lx�? @ ��|гY�? 9��v���? � h"lx�? ��|гY�? 9��v���? � h"lx�? @ @ @]` Commented Apr 2, 2016 at 20:44
  • What is the shape and dtype of the b generated from xtrans[splitted_clusters]? We can't debug your problem with out a sample of xtrans or idea of how that gives transformed to produce the new b. Commented Apr 2, 2016 at 20:49
  • 1
    Could this be a floating-point precision issue? ie, the floats look the same when printed, but are actually not bitwise-identical? Try using np.round and see if that makes a difference. Commented Apr 2, 2016 at 21:07
  • 1
    Attempting to perform equality tests on general floating point values is tricky. Try xtrans[i,:]==xtrans[j,:] for any two rows that you think are identical. Or look xtrans[i,:]-xtrans[j,:]. The rows might not be as unique as you think. Commented Apr 2, 2016 at 22:39

3 Answers 3

2

You can do it by converting list to set.

 aList = [[ 0.157, -0.4778], [ 0.157, -0.4778],[-0.06156924,
 -0.21786049], [ 0.02, -0.35]]
  1. Make a list of tuples from list of lists, otherwise you will not be able to create set or dictionary from it.
  2. Set constructor will do rest for you

    set([tuple(a) for a in aList])

Output:

set([(-0.06156924, -0.21786049), (0.02, -0.35), (0.157, -0.4778)])
Sign up to request clarification or add additional context in comments.

2 Comments

Then, of course, you can convert it back to two dimensional list
thank you for this idea, but I need to save the original indices of the array. For example in first code in my question the indices are in variable idx
1

The numpy_indexed package (disclaimer: I am its author) implements functionality of this kind, in a manner similar to the solution you posted. But hopefully, its units tests will prove themselves useful, and things work as expected... Could you give it a try on your dataset, and see if it has the same problem?

import numpy_indexed as npi
npi.unique(cluster)
# try this as well, to see if fp representation has something to do with it
npi.unique(cluster.round(4))   

Comments

0

A solution to finding unique rows in your numpy array would be

In [13]: uniq_vals, counts = np.unique(cluster, axis=0, return_counts=True)

In [14]: uniq_vals
Out[14]:
array([[-0.06156924, -0.21786049],
       [ 0.02      , -0.35      ],
       [ 0.157     , -0.4778    ]])

In [15]: counts
Out[15]: array([2, 1, 3], dtype=int64)

The option return_counts allows you to obtain the counts of unique rows.

This solution is explained in Find unique rows in numpy.array

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.