Create array of index values from list with another list python

Question

I have an array of values as well as another array which I would like to create an index to. For example:

value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]])
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]])
MagicFunction(value_list,key_list)
#result = [[0,1,0,2]] which has the same length as value_list

The solutions I have seen online after researching are not quite what I am asking for I believe, any help would be appreciated! I have this brute force code which provides the result but I don't even want to test it on my actual data size

T = np.zeros((len(value_list)), dtype = np.uint32)
for i in range(len(value_list)):
    for j in range(len(key_list)):
        if sum(value_list[i] == key_list[j]) == 3:
            T[i] = j

Instead of doing sum(value_list[i] == key_list[j]) == 3, it would be better to do (value_list[i] == key_list[j]).all(). This both generalizes to any size, not just 3, and it makes it clearer what the code's function is. You could also add break after T[i] = j to save yourself some time. — Nathan
– Nathan, Commented Mar 3, 2019 at 2:57

Nathan · Accepted Answer · 2019-03-03 18:41:12Z

3

The issue is how to get this to be not terribly inefficient. I see two approaches

use a dictionary so that the lookups will be fast. numpy arrays are mutable, and thus not hashable, so you'll have to convert them into, e.g., tuples to use with the dictionary.
Use broadcasting to check value_list against every "key" in key_list in a vectorized fashion. This will at least bring the for loops out of Python, but you will still have to compare every value to every key.

I'm going to assume here too that key_list only has unique "keys".

Here's how you could do the first approach:

value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]])
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]])

key_map = {tuple(key): i for i, key in enumerate(key_list)}
result = np.array([key_map[tuple(value)] for value in value_list])
result # array([0, 1, 0, 2])

And here's the second:

result = np.where((key_list[None] == value_list[:, None]).all(axis=-1))[1]
result # array([0, 1, 0, 2])

Which way is faster might depend on the size of key_list and value_list. I would time both for arrays of typical sizes for you.

EDIT - as noted in the comments, the second solution doesn't appear to be entirely correct, but I'm not sure what makes it fail. Consider using the first solution instead.

edited Mar 3, 2019 at 18:41

answered Mar 3, 2019 at 2:52

Nathan

10.5k4 gold badges51 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Robin White Over a year ago

Thank you for this, your 1st solution worked well and it was only about 4 seconds which is completely reasonable for the array sizes I am currently using: value_list.shape (1783296, 3), key_list.shape (59273, 3). Your second solution ran an error of 'bool' object has no attribute 'all'.

Nathan Over a year ago

@RobinWhite if this answers your question, you can accept it by clicking on the check mark on the left side of the answer to let others know that your problem has been resolved. If you can edit your question with a small example array where my second solution fails, I could try to see what's happening there, but the other solution will be more efficient when you need to make a lot of comparisons, so if that works, we can just leave it. I'll make an edit that the second solution doesn't always work.

Robin White Over a year ago

thanks very much for your comment, I appreciate the clarification. I am obtaining the array from a stack of images. I just ran your second code again on a single image and it worked, so I am a little confused as to why it gave me that error on the stack. Regardless, the time was about 4s for the single image compared to about 0.05s for your first solution - so I will be sticking with that one.

Robin White Over a year ago

I tried this again and your second solution is working, there must have been something I cleaned up in my code. The time is 4s for solution 1 compared with 43s for solution 2. Thank you again, I really appreciate the help

fountainhead · Accepted Answer · 2019-03-04 04:31:36Z

0

Assumptions:

Every element of value_list will be present in key_list (at some position or the other)
We are interested in the index within key_list, of only the first match

Solution:

From the two arrays, we create views of 3-tuples. We then broadcast the two views in two orthogonal directions and then check for element-wise equality on the broadcasted arrays.

import numpy as np

value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]], dtype='uint8')
key_list   = np.array([[2,2,3],[255,243,198],[50,35,3]], dtype='uint8')

# Define a new dtype, describing a "structure" of 3 uint8's (since
# your original dtype is uint8). To the fields of this structure,
# give some arbitrary names 'first', 'sec', and 'third'
dt = np.dtype([('first', np.uint8, 1),('sec', np.uint8, 1),('third', np.uint8, 1)])

# Now view the arrays as 1-d arrays of 3-tuples, using the dt
v_value_list = value_list.view(dtype=dt).reshape(value_list.shape[0])
v_key_list   = key_list.view(dtype=dt).reshape(key_list.shape[0])

result = np.argmax(v_key_list[:,None] == v_value_list[None,:], axis=0)
print (result)

Output:

[0, 1, 0, 2]

Notes:

Though this is a pure numpy solution without any visible loops, it could have hidden inefficiencies, because, it matches every element of value_list with every element of key_list, in contrast with a loop-based search that smartly stops upon the first successful match. Any advantage gained will be dependent upon the actual size of key_list, and upon where the successful matches occur, in key_list. As the size of key_list grows, there might be some erosion of the numpy advantage, especially if the successful matches happen mostly in the earlier part of key_list.
The views that we are creating are in fact numpy structured arrays, where each element of the view is a structure of two int s. One, interesting question which I haven't yet explored is, when numpy compares one structure with another, does it perform a comparison of every field in the structure, or, does it short-circuit the field-comparisons at the first failed field of the structure? Any such short-cicuiting could imply a small additional advantage to this structured array solution.

edited Mar 4, 2019 at 4:31

answered Mar 3, 2019 at 4:31

fountainhead

3,7421 gold badge11 silver badges18 bronze badges

4 Comments

Robin White Over a year ago

Thank you for the suggestion, when I tried this on my actual data arrays I received an error: 'When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.' I'm using numpy version 1.15.4. I'm not quite sure what this is telling me. my arrays are value_list.shape (1783296, 3), key_list.shape (59273, 3)

fountainhead Over a year ago

Can you let me know your value_list.dtype and key_list.dtype?

Robin White Over a year ago

they are both uint8

fountainhead Over a year ago

@RobinWhite: I have updated my answer accordingly, to avoid the error you mentioned (just the part under the "Solution" heading. I've tested this for the error (after first simulating the error). Should work. Pls let me know how it goes.

Collectives™ on Stack Overflow

Create array of index values from list with another list python

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related