1

I have to calculate the pairwise weighted distance of a potentially large number of vectors. Let A be the m vectors/n atribute matrix, W is a 1xn weight vector.

My first function calculates the pairwise distance the naive way:

def weightedDistance(A,W):
    aff = np.zeros((len(A),len(A)))
    for i in range(len(A)):
        for j in range(len(A)):
            aff[i][j] = np.sum(((A[i] - A[j])**2)*W)
    return aff

This takes over twenty seconds for a 1000x2 matrix.

Then I tried a faster version, using matrix multiplication, based on this blog post:

def weightedDistanceFast(A,W):
    AA = np.sum((A**2)*W,1)
    A2 = -2*((A*W).dot(A.T))
    return (AA+A2).W+AA

This is "instant" for the same 1000x2 matrix. However, it introduces a bunch of errors at the order of e-17.

>>> dist = desc.weightedDistanceFast(attr,(1,1))
>>> dist2 = desc.weightedDistance(attr,(1,1))
>>> np.sum(dist-dist2)
6.9155863612166586e-11
>>> len(np.where((dist-dist2 > 0))[0])
380824

Checking other answers around here, I also tried to use the wminkowski distance from the sklearn.distance package, but it yielded different results from the above two functions depending on the value of W.

def weightedDistanceMinkowski(A,W):
    aff = sp.squareform(sp.pdist(A,metric='wminkowski',p=2,w=W))
    return aff**2

Anyway, I know that a certain degree of error is inevitable, but since I'm going to do further calculations on this distance matrix, I wonder if there is a way to reduce this error (or, alternatively, a less error-prone way to make function 1 faster).

Thank you!

4
  • The real question is do you really need that level of precision? If not (likely!) just go for the speed! And if you did need it, then you'll probably have to worry more about the noise in your data than the numerical errors... Commented Dec 22, 2015 at 2:24
  • If necessary, you can increase the precision by changing an array's data type, e.g., A = A.astype(np.float64) or similar. Commented Dec 22, 2015 at 2:27
  • What makes you so confident that the first answer is the "correct" one? Both will have rounding errors, but if I had to guess I would expect the second answer to be closer to the true distance value. np.dot usually calls BLAS routines, which can make use of things like fused multiply-add operations to reduce the total number of floating point operations required and hence the rounding error. Commented Dec 22, 2015 at 3:30
  • Thank you all for the comments -- Julien is quite correct in mentioning that if I care about this level of error I might have bigger problems. So I guess my zeroth question should be if these errors really matter, to which I say "I'm not sure yet". Time to think about it. ali_m: as for the correctness, the errors occur even on very small matrixes, which I did check by hand. This leads me to believe that it is numpy introducing the errors, but thanks for the comment! I'm still interested in the answer for curiosity's sake. I wonder if what Reti says is all there is to it. Commented Jan 4, 2016 at 2:03

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.