Parallel calculation on a sparse matrix in python

Question

I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:

good_inds = {}
for i in range(0,len(X)):
   for j in range(x+1,len(X)):
       score = f(X[i],X[j])
       if score<T:
           good_inds[x,y] = score

This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process. Please advise.

x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]? — meowgoesthedog
– meowgoesthedog, Commented Jan 1, 2019 at 14:53
answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression — Sam Mason
– Sam Mason, Commented Feb 4, 2019 at 14:53

Roland Smith · Accepted Answer · 2019-01-01 15:41:33Z

1

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools

In [7]: list(itertools.combinations(range(4), 2))                                                        
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):
    i, j = pair
    rv = False
    if f(X[i],X[j]) < T:
       rv = True
    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
    if result:
        print('Good pair:', i, j)
        # do something with the results...

This will run as many workers as you CPU has cores.

edited Jan 1, 2019 at 15:41

answered Jan 1, 2019 at 15:35

Roland Smith

43.8k3 gold badges69 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mousomer Over a year ago

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

mousomer · Accepted Answer · 2019-01-02 15:35:09Z

0

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

answered Jan 2, 2019 at 15:35

mousomer

2,8882 gold badges25 silver badges27 bronze badges

Collectives™ on Stack Overflow

Parallel calculation on a sparse matrix in python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related