2

I have a 2 * N integer array ids representing intervals, where N is about a million. It looks like this

 0 2 1 ...
 3 4 3 ...

The ints in the arrays can be 0, 1, ... , M-1, where M <= 2N - 1. (Detail: if M = 2N, then the ints span all the 2N integers; if M < 2N, then there are some integers that have the same values.)

I need to calculate a kind of inverse map from ids. What I called "inverse map" is to see ids as intervals and capture the relation from their inner points with their indices.

Intuition Intuitively,

 0 2 1 
 3 4 3 

can be seen as

0 -> 0, 1, 2
1 -> 2, 3
2 -> 1, 2 

where the right-hand-side endpoints are excluded for my problem. The "inverse" map would be

0 -> 0
1 -> 0, 2
2 -> 0, 1, 2
3 -> 1

Code I have a piece of Python code that attempts to calculate the inverse map in a dictionary inv below:

  for i in range(ids.shape[1]):
    for j in range(ids[0][i], ids[1][i]):
        inv[j].append(i)

where each inv[j] is an array-like data initialized as empty before the nested loop. Currently I use python's built-in arrays to initialize it.

  for i in range(M):  inv[i]=array.array('I')

Question The nested loop above works like a mess. In my problem setting (in image processing), my first loop has a million iterations; second one about 3000 iterations. Not only it takes much memory (because inv is huge), it is also slow. I would like to focus on speed in this question. How can I accelerate this nested loop above, e.g. with vectorization?

10
  • 1
    If you have a list where each interval is (0, 2*N-1) and there are N such intervals, the complexity will be O(N^2). How can you reduce the complexity further? Commented Oct 21, 2020 at 9:50
  • 1
    Seems to be the worst scenario. No idea how to work around this O(N*2) issue. Given that N is about a million, O(N*2) would be indeed a big issue. Commented Oct 21, 2020 at 9:53
  • 1
    Why do you need this inverse mapping though, that should determine if we can hack around something in this case Commented Oct 21, 2020 at 9:55
  • 1
    @AbhinavMathur I need to have this inverse mapping as a kind of look up table. Calculating the mapping is in a preprocessing step but it should not be too slow. Does this reply to your question? Commented Oct 21, 2020 at 10:32
  • 1
    @AbhinavMathur. The inverse lookup would allow my later applications to quickly know which intervals would contain a specific point. With the example in the "intuition" part, I will need to know which of the intervals cover a point say 2. It may be called an interval-point problem: Given a set of intervals and a point, know which intervals contain that point. I could have used a range tree or segment tree to achieve that, but I would hope to establish a look up table for being very fast later. Thank you for asking. Commented Oct 21, 2020 at 12:17

3 Answers 3

2

You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis(). Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):

import numpy as np
import array

ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids)   # Convert to numpy array. Expensive operation?

range_index = 0   # Initialize. To be bumped up by each invocation of my_func()

inv = {}
for i in range(np.max(ids_arr)):
    inv[i] = array.array('I')

def my_func(my_slice):
    global range_index
    for i in range(my_slice[0], my_slice[1]):
        inv[i].append(range_index)
    range_index += 1

np.apply_along_axis (my_func,0,ids_arr)
print (inv)

Output:

{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]), 3: array('I', [1])}

Edit:

I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy array indexing. Use the below lines to create and initialize inv as a numpy array of Python arrays. The rest of the code can remain as-is:

inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
    inv[i] = array.array('I')

(Note: This assumes that your application isn't doing dict-specific stuff on inv, such as inv.items() or inv.keys(). If that's the case, however, you might need an extra step to convert the numpy array into a dict)

Sign up to request clarification or add additional context in comments.

8 Comments

Thank you! No problem about initial overhead involved in converting lists to numpy arrays, as it is provided as a numpy array.
@zell - You're welcome. I just added an edit to my answer to suggest having inv as a numpy array of Python arrays, instead of having it as a dictionary of Python arrays.
Thanks! Earlier, I was using a numpy array of lists, but some people numpy arrays of lists should be avoided. See stackoverflow.com/questions/64440765/…. Do you think a numpy array of python array makes sense although a numpy array of lists does not according to the link?
@zell - Glad that there's some improvement. Did you stick to the dictionary, or replace it with a numpy array? Also, a word of caution: When you compare solutions to pick the better one, I hope you're using the exact same data across the alternative solutions -- not just in terms of the size of ids, but also its content). That's because the size of the final constructed inv depends on the content as well as length, of ids
For the same reason, you may also want to repeat the tests by varying the content of ids, before you can be really sure about your final pick
|
2

avoid for loop, just a pandas sample

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": np.random.randint(0, 100, 100000),
    "B": np.random.randint(0, 100, 100000)
})

df.groupby("B")["A"].agg(list)

4 Comments

How is this what OP requires?
Thanks. How does this replace the nested loop in my question?
firstly, transform raw data to pair list like data = [ [0, 0], [0, 1], [0, 2], ... ] ; than make df = pd.DataFrame(data, columns=["A", "B"]) df.groupby("B")["A"].agg(list)
Thanks. This approach is elegant, but seems to have a memory issue. I have about a million entries, each entry has a range [a,b] where b-a could be about 3000. With your approach, I got 3*10^9 entries. In your example above, I tried with "10**9" where you used 100000, and the code got killed after running some minutes due to out of memory (on a 16G laptop). Could we do the same as you proposed with h5py?
2

Since the order of N is large, I've come up with what seems like a practical approach; let me know if there are any flaws.

  1. For the ith interval as [x,y], store it as [x,y,i]. Sort the arrays based on their start and end times. This should take O(NlogN) time.
  2. Create a frequency array freq[2*N+1]. For each interval, update the frequency using the concept of range update in O(1) per update. Generating the frequencies gets done in O(N).
  3. Determine a threshold, based on your data. According to that value, the elements can be specified as either sparse or frequent. For sparse elements, do nothing. For frequent elements only, store the intervals in which they occur.
  4. During lookup, if there is a frequent element, you can directly access the pre-computed lists. If the element is a sparse one, you can search the intervals in O(logN) time, since the intervals are sorted and there indexes were appended in step 1.

This seems like a practical approach to me, rest depends on your usage. Like the amortized time complexity you need per query and so on.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.