2

I have an array of values as well as another array which I would like to create an index to. For example:

value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]])
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]])
MagicFunction(value_list,key_list)
#result = [[0,1,0,2]] which has the same length as value_list

The solutions I have seen online after researching are not quite what I am asking for I believe, any help would be appreciated! I have this brute force code which provides the result but I don't even want to test it on my actual data size

T = np.zeros((len(value_list)), dtype = np.uint32)
for i in range(len(value_list)):
    for j in range(len(key_list)):
        if sum(value_list[i] == key_list[j]) == 3:
            T[i] = j
1
  • Instead of doing sum(value_list[i] == key_list[j]) == 3, it would be better to do (value_list[i] == key_list[j]).all(). This both generalizes to any size, not just 3, and it makes it clearer what the code's function is. You could also add break after T[i] = j to save yourself some time. Commented Mar 3, 2019 at 2:57

2 Answers 2

3

The issue is how to get this to be not terribly inefficient. I see two approaches

  1. use a dictionary so that the lookups will be fast. numpy arrays are mutable, and thus not hashable, so you'll have to convert them into, e.g., tuples to use with the dictionary.

  2. Use broadcasting to check value_list against every "key" in key_list in a vectorized fashion. This will at least bring the for loops out of Python, but you will still have to compare every value to every key.

I'm going to assume here too that key_list only has unique "keys".

Here's how you could do the first approach:

value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]])
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]])

key_map = {tuple(key): i for i, key in enumerate(key_list)}
result = np.array([key_map[tuple(value)] for value in value_list])
result # array([0, 1, 0, 2])

And here's the second:

result = np.where((key_list[None] == value_list[:, None]).all(axis=-1))[1]
result # array([0, 1, 0, 2])

Which way is faster might depend on the size of key_list and value_list. I would time both for arrays of typical sizes for you.

EDIT - as noted in the comments, the second solution doesn't appear to be entirely correct, but I'm not sure what makes it fail. Consider using the first solution instead.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for this, your 1st solution worked well and it was only about 4 seconds which is completely reasonable for the array sizes I am currently using: value_list.shape (1783296, 3), key_list.shape (59273, 3). Your second solution ran an error of 'bool' object has no attribute 'all'.
@RobinWhite if this answers your question, you can accept it by clicking on the check mark on the left side of the answer to let others know that your problem has been resolved. If you can edit your question with a small example array where my second solution fails, I could try to see what's happening there, but the other solution will be more efficient when you need to make a lot of comparisons, so if that works, we can just leave it. I'll make an edit that the second solution doesn't always work.
thanks very much for your comment, I appreciate the clarification. I am obtaining the array from a stack of images. I just ran your second code again on a single image and it worked, so I am a little confused as to why it gave me that error on the stack. Regardless, the time was about 4s for the single image compared to about 0.05s for your first solution - so I will be sticking with that one.
I tried this again and your second solution is working, there must have been something I cleaned up in my code. The time is 4s for solution 1 compared with 43s for solution 2. Thank you again, I really appreciate the help
0

Assumptions:

  1. Every element of value_list will be present in key_list (at some position or the other)
  2. We are interested in the index within key_list, of only the first match

Solution:

From the two arrays, we create views of 3-tuples. We then broadcast the two views in two orthogonal directions and then check for element-wise equality on the broadcasted arrays.

import numpy as np

value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]], dtype='uint8')
key_list   = np.array([[2,2,3],[255,243,198],[50,35,3]], dtype='uint8')

# Define a new dtype, describing a "structure" of 3 uint8's (since
# your original dtype is uint8). To the fields of this structure,
# give some arbitrary names 'first', 'sec', and 'third'
dt = np.dtype([('first', np.uint8, 1),('sec', np.uint8, 1),('third', np.uint8, 1)])

# Now view the arrays as 1-d arrays of 3-tuples, using the dt
v_value_list = value_list.view(dtype=dt).reshape(value_list.shape[0])
v_key_list   = key_list.view(dtype=dt).reshape(key_list.shape[0])

result = np.argmax(v_key_list[:,None] == v_value_list[None,:], axis=0)
print (result)

Output:

[0, 1, 0, 2]

Notes:

  1. Though this is a pure numpy solution without any visible loops, it could have hidden inefficiencies, because, it matches every element of value_list with every element of key_list, in contrast with a loop-based search that smartly stops upon the first successful match. Any advantage gained will be dependent upon the actual size of key_list, and upon where the successful matches occur, in key_list. As the size of key_list grows, there might be some erosion of the numpy advantage, especially if the successful matches happen mostly in the earlier part of key_list.

  2. The views that we are creating are in fact numpy structured arrays, where each element of the view is a structure of two int s. One, interesting question which I haven't yet explored is, when numpy compares one structure with another, does it perform a comparison of every field in the structure, or, does it short-circuit the field-comparisons at the first failed field of the structure? Any such short-cicuiting could imply a small additional advantage to this structured array solution.

4 Comments

Thank you for the suggestion, when I tried this on my actual data arrays I received an error: 'When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.' I'm using numpy version 1.15.4. I'm not quite sure what this is telling me. my arrays are value_list.shape (1783296, 3), key_list.shape (59273, 3)
Can you let me know your value_list.dtype and key_list.dtype?
they are both uint8
@RobinWhite: I have updated my answer accordingly, to avoid the error you mentioned (just the part under the "Solution" heading. I've tested this for the error (after first simulating the error). Should work. Pls let me know how it goes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.