1

I have a performance problem with replacing values of a list of arrays using a dictionary.

Let's say this is my dictionary:

# Create a sample dictionary

keys = [1, 2, 3, 4]
values = [5, 6, 7, 8]
dictionary = dict(zip(keys, values))

And this is my list of arrays:

# import numpy as np

# List of arrays
listvalues = []

arr1 = np.array([1, 3, 2])
arr2 = np.array([1, 1, 2, 4])
arr3 = np.array([4, 3, 2])

listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)

listvalues
>[array([1, 3, 2]), array([1, 1, 2, 4]), array([4, 3, 2])]

I then use the following function to replace all values in a nD numpy array using a dictionary:

# Replace function

def replace(arr, rep_dict):

    rep_keys, rep_vals = np.array(list(zip(*sorted(rep_dict.items()))))
    idces = np.digitize(arr, rep_keys, right=True)

    return rep_vals[idces]

This function is really fast, however I need to iterate over my list of arrays to apply this function to each array:

replaced = []
for i in xrange(len(listvalues)):
    replaced.append(replace(listvalues[i], dictionary))

This is the bottleneck of the process, as it needs to iterate over thousands of arrays.

How could I do achieve the same result without using the for-loop? It is important that the result is in the same format as the input (a list of arrays with replaced values)

Many thanks guys!!

6
  • to be clear; listvalues is a very long sequence of many very short arrays, of variable length? is there a natural upper bound to the length of these short arrays? Commented Jul 21, 2016 at 15:20
  • Basically yes, the arrays are quite short but there is no natural upper bound to the length of these arrays or the list. Altough most arrays are not longer than len 20. Hope this helps! Commented Jul 21, 2016 at 15:23
  • Your bottleneck is the replace function and not the loop. Nothing significant is happening in your loop so you either need to improve the performance of replace or parallelize the loop. Commented Jul 21, 2016 at 15:33
  • @sirfz : nope, read the comments Commented Jul 21, 2016 at 15:47
  • @sirfz the replace function is not the problem, its really fast! I can do 10 loops in under a second. Problem is I have to do thousands of them.. Parallelizing is not possible as I already run the whole script multi-threaded. Commented Jul 21, 2016 at 15:56

1 Answer 1

2

This will do the trick efficiently, using the numpy_indexed package. It can be further simplified if all values in 'listvalues' are guaranteed to be present in 'keys'; but ill leave that as an exercise to the reader.

import numpy_indexed as npi
arr = np.concatenate(listvalues)
idx = npi.indices(keys, arr, missing='mask')
remap = np.logical_not(idx.mask)
arr[remap] = np.array(values)[idx[remap]]
replaced = np.array_split(arr, np.cumsum([len(a) for a in listvalues][:-1]))
Sign up to request clarification or add additional context in comments.

5 Comments

Awsome, thanks! This replaces all the values in just a few seconds :). However I do get a warning when doing arr[remap] = ... on my whole dataset: DeprecationWarning: assignment will raise an error in the future, most likely because your index result shape does not match the value array shape Thanks again!
sorry; didn't get that error nor am I familiar with it; and I don't right away have a clue what causes it. what versions are you using?
Python 2.7x. Thanks for letting me know. I will do some more testing tomorrow
Seemed to be an error in my code, working perfect now. Thanks again! :)
Glad to hear; is there an explicit reason not to accept this as an answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.