0

I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:

good_inds = {}
for i in range(0,len(X)):
   for j in range(x+1,len(X)):
       score = f(X[i],X[j])
       if score<T:
           good_inds[x,y] = score

This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process. Please advise.

2
  • x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]? Commented Jan 1, 2019 at 14:53
  • 1
    answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression Commented Feb 4, 2019 at 14:53

2 Answers 2

1

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools

In [7]: list(itertools.combinations(range(4), 2))                                                        
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):
    i, j = pair
    rv = False
    if f(X[i],X[j]) < T:
       rv = True
    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
    if result:
        print('Good pair:', i, j)
        # do something with the results...

This will run as many workers as you CPU has cores.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
0

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.