1

I have an 1 dimensional sorted array and would like to find all pairs of elements whose difference is no larger than 5.

A naive approach would to be to make N^2 comparisons doing something like

diffs = np.tile(x, (x.size,1) ) - x[:, np.newaxis]
D = np.logical_and(diffs>0, diffs<5)
indicies = np.argwhere(D)

Note here that the output of my example are indices of x. If I wanted the values of x which satisfy the criteria, I could do x[indicies]. This works for smaller arrays, but not arrays of the size with which I work.

An idea I had was to find where there are gaps larger than 5 between consecutive elements. I would split the array into two pieces, and compare all the elements in each piece.

Is this a more efficient way of finding elements which satisfy my criteria? How could I go about writing this?

Here is a small example:

x = np.array([ 9, 12, 
           21, 
           36, 39, 44, 46, 47, 
           58, 
           64, 65,])

the result should look like

array([[ 0,  1],
       [ 3,  4],
       [ 5,  6],
       [ 5,  7],
       [ 6,  7],
       [ 9, 10]], dtype=int64)
6
  • You could have 2 indices - left and right. Left would be responsible for a smaller element, right - for a larger. So, once you find all the valid rights for current left, you could increment left and continue with the right. It's a O(n) approach and you can do it because the array is sorted.. Commented Jun 12, 2018 at 19:30
  • @user1055395 it can't be O(n) and correct at the same time because if all elements are close just enumerating all pairs is already O(n^2) Commented Jun 12, 2018 at 20:25
  • @PaulPanzer at a time when I added a comment I thought the duplicates were not allowed. After I learned they're allowed I agree that O(N^2) is the worst case here. Commented Jun 12, 2018 at 20:30
  • Due to the nature of the problem, I know there are lots of gaps larger than 5, so while O(N^2) is worst case, I am hopeful that it will be much faster than that. Commented Jun 12, 2018 at 20:32
  • Can you please provide an example of the data? Commented Jun 12, 2018 at 20:50

2 Answers 2

1

Here is a solution that iterates over offsets while shrinking the set of candidates until there are none left:

import numpy as np

def f_pp(A, maxgap):
    d0 = np.diff(A)
    d = d0.copy()
    IDX = []
    k = 1
    idx, = np.where(d <= maxgap)
    vidx = idx[d[idx] > 0]
    while vidx.size:
        IDX.append(vidx[:, None] + (0, k))
        if idx[-1] + k + 1 == A.size:
            idx = idx[:-1]
        d[idx] = d[idx] + d0[idx+k]
        k += 1
        idx = idx[d[idx] <= maxgap]
        vidx = idx[d[idx] > 0]
    return np.concatenate(IDX, axis=0)

data = np.cumsum(np.random.exponential(size=10000)).repeat(np.random.randint(1, 20, (10000,)))

pairs = f_pp(data, 1)
#pairs = set(map(tuple, pairs))

from timeit import timeit
kwds = dict(globals=globals(), number=100)
print(data.size, 'points', pairs.shape[0], 'close pairs')
print('pp', timeit("f_pp(data, 1)", **kwds)*10, 'ms')

Sample run:

99963 points 1020651 close pairs
pp 43.00256529124454 ms
Sign up to request clarification or add additional context in comments.

Comments

0

Your idea of slicing the array is a very efficient approach. Since your data are sorted you can just calculate the difference and split it:

d=np.diff(x)
ind=np.where(d>5)[0]
pieces=np.split(x,ind)

Here pieces is a list, where you can then use in a loop with your own code on every element.

The best algorithm is highly dependent on the nature of your data which I'm unaware. For example another possibility is to write a nested loop:

pairs=[]
for i in range(x.size):
    j=i+1
    while x[j]-x[i]<=5 and j<x.size:
        pairs.append([i,j])
        j+=1

If you want it to be more clever, you can edit the outer loop in a way to jump when j hits a gap.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.