2

I have a list of vectors (each vectors only contain 0 or 1) :

In [3]: allLabelPredict   
Out[3]: array([[ 0.,  0.,  0., ...,  0.,  0., 1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.]])

In [4]: allLabelPredict.shape  
Out[4]: (5000, 190)

As you can see I have 190 different vectors each vector is a result of one classifier, now I want to select some of these output based on proximity of each vector to my original label

In [7]: myLabel
Out[7]: array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

In [8]: myLabel.shape
Out[8]: (5000,)

For this purpose I've defined two different criteria for each vector; Zero Hamming Distance and One Hamming Distance.

"One Hamming Distance": hamming distance between the sub-array of myLabel which are equal to "1" and sub-array of each vector (I have created sub-array of each vector by selecting value from each vector based on indices of "myLabel" where the value is '1')

"zero Hamming Distance": hamming distance between the sub-array of myLabel which are equal to "0" and sub-array of each vector (I have created sub-array of each vector by selecting value from each vector based on indices of "myLabel" where the value is '0')

To make it more clear will give you a small example:

MyLabel [1,1,1,1,0,0,0,0]   
V1 [1,1,0,1,0,0,1,1]   
sub-array1 [1,1,0,1]
sub-array0 [0,0,1,1]

"zero Hamming Distance": hamming(sub-array0, MyLabel[4:])

"one Hamming Distance": hamming(sub-array1, MyLabel[:4])

Now I want to select some vectors from 'allLabelPredict' based on "One Hamming Distance" and "zero Hamming Distance"

I want to select those vectors which have the minimum "One Hamming Distance" and "zero Hamming Distance". (by minimum I mean both criteria for this vector be the lowest amongst others)

If above request is not possible how can I do something like this sort somehow that always sort first based on "One Hamming Distance" and after that try to minimize "Zero Hamming Distance"

0

4 Answers 4

1

OK, so first I'd split up the entire allLabelPredict into two subarrays based on the values in myLabel:

import numpy as np

allLabelPredict = np.random.randint(0, 2, (5000, 190))
myLabel = np.random.randint(0, 2, 5000)

sub0 = allLabelPredict[myLabel==0]
sub1 = allLabelPredict[myLabel==1]

ham0 = np.abs(sub0 - 0).mean(0)
ham1 = np.abs(sub1 - 1).mean(0)
hamtot = np.abs(allLabelPredict - myLabel[:, None]).mean(0)  # if they're not split

This is the same as scipy.spatial.distance.hamming, but that can only be applied to one vector at a time:

>>> np.allclose(scipy.spatial.distance.hamming(allLabelPredict[:,0], myLabel),
...             np.abs(allLabelPredict[:,0] - myLabel).mean(0))
True

Now, the indices in either ham array will be the indices in the second axis of the allLabelPredict array. If you want to sort your vectors by hamming distance:

sortby0 = allLabelPredict[:, ham0.argsort()]
sortby1 = allLabelPredict[:, ham1.argsort()]

Or if you want the lowest zero (or one) hamming, you would look at

best0 = allLabelPredict[:, ham0.argmin()]
best1 = allLabelPredict[:, ham1.argmin()]

Or if you want the lowest one hamming with zero hamming near 0.1, you could say something like

hamscore = (ham0 - 0.1)**2 + ham1**2
best = allLabelPredict[:, hamscore.argmin()]
Sign up to request clarification or add additional context in comments.

9 Comments

I don't exactly get your solution, the point in my question is that I have two different hamming distances and I want to sort based on these two (it's important to be separated each of them shows me different thing) now I want to minimize both at the same time or if it's impossible select one value for one and minimize second one (for example give vectors that have zeros hamming distance of 0.2 and minimum value of one hamming distance)
It's fixed now, sorry I misunderstood at first. If you want to minimize both, you may as well take the hamming of the entire vectors, since the sum of the two hamming distances is just the total hamming distance.
I am not sure either I don't understand your solution or you didn't understand my question, for example why allLabelPredict[myLabel==0] when I am working on each vector which is allLabelPredict[:,i] and why are you not just using scipy Hamming distance function to calculate hamming distances and at the end where did you give me vecotrs which has best zeros and one hamming distances (or as I said earlier oneHam = 0.25 and minimum value of zeroHam) Sorry for any inconvenience caused.
scipy.spatial.distance.hamming seems to only take one vector at a time. Maybe there's another function (like pdist) that allows me to give a matrix, but mine should give the hamming distance as defined here
You wanted to have the part of each vector that aligned with the parts of myLabel that were equal to 0 (or 1), that's what allLabelPredict[myLabel==0] (or ==1) gives. I don't look at one vector at a time because I can do all of them at once :)
|
1

The crux of the answer should include this: use sorted(allLabelPredict, key=<criteria>)

It will let you sort the list based on the criteria you defined as a function and passed to keys argument.

To do this, first let's convert your 190 vectors into pair of (0-H Dist, 1-H Dist). Then you'll have something like this:

(0.10, 0.15)
(0.12, 0.09)
(0.25, 0.03)
(0.14, 0.16)
(0.14, 0.11)
...

Next, we need to clarify what you meant by "both criteria for this vector be the lowest amongst others". In the above case, should we choose (0.25, 0.03)? Or is it (0.10, 0.15)? How about (0.14, 0.11)? Fortunately you already said that in this case, we need to prioritize 1-H Dist first. So we will choose (0.25, 0.03), is this correct? From your comments in @askewchan's answer it seems that you want the sort criteria to be flexible.

If that's so, then your first criterion that "both criteria for this vector be the lowest amongst others" is actually part of your second criterion, which is "sort based on One Hamming Distance, then by Zero Hamming Distance", since after the sorting the vector with lowest distance on both scores will be at the top anyway.

Hence we just need to sort based on 1-D Dist and then by 0-H Dist when the 1-H Dist score is the same. This sort criteria can be changed flexibly, as long as you already have the pair of scores.

Here is a sample code:

import numpy as np
from scipy.spatial.distance import hamming

def sort_criteria(pair_of_scores):
    score0, score1 = pair_of_scores
    return (score1, score0)  # Sort by 1-H, then by 0-H

    # The following will sort by Euclidean distance
    #return score0**2 + score1**2

    # The following is to select the vectors with score0==0.5, then sort based on score1
    #return score1 if np.abs(score0-0.5)<1e7 else (1+score1, score0) == 0.5

def main():
    allLabelPredict = np.asarray(np.random.randint(0, 2, (5, 10)), dtype=np.float64)
    myLabel = np.asarray(np.random.randint(0, 2, 10), dtype=np.float64)
    print allLabelPredict
    print myLabel

    allSub0 = allLabelPredict[:, myLabel==0]
    allSub1 = allLabelPredict[:, myLabel==1]
    all_scores = [(hamming(sub0, 0), hamming(sub1, 1))
                  for sub0, sub1 in zip(allSub0, allSub1)]
    print all_scores  # The (0-H, 1-H) score pairs

    all_scores = sorted(all_scores, key=sort_criteria)  # The sorting
    #all_scores = np.array([pair for pair in all_scores if pair[0]==0.5])  # For filtering

    print all_scores

if __name__ == '__main__':
    main()

Result:

[[ 1.  0.  0.  0.  0.  1.  1.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  1.  0.  1.  1.  1.  1.  1.  0.]
 [ 0.  0.  1.  1.  1.  1.  1.  0.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  0.  0.  0.  0.]]
[ 1.  1.  1.  1.  1.  0.  1.  1.  0.  1.]
[(1.0, 0.625), (0.0, 0.5), (1.0, 0.375), (1.0, 0.375), (0.5, 0.375)]
[(0.5, 0.375), (1.0, 0.375), (1.0, 0.375), (0.0, 0.5), (1.0, 0.625)]

You just need to change the sort_criteria function to change your criteria.

2 Comments

thanks for your answer the things that is missing in your answer and even in askewchan answer is that I want to be very flexible about my my both criteria, which I mean for example sort somehow that select those has 0.5 for Zeroham and minimize the OneHam, I thought about it and have reached the conclusion that the thing I want (I couldn't explain it very well) is somehow impossible so I need multiple function with different objective for this, anyway thaaks for you help.
I included that case with the last return in my sort_criteria. Or do you want to output only those with Zeroham==0.5, discarding other vectors with Zeroham!=0.5? In that case you can just do a filter after the sort. See my updated answer.
0

If you sort first based on one criteria, then another, the first entry in that sort will be the only one that could simultaneously minimize both criteria.

You can do that operation with numpy using argsort. This requires you to make a numpy array that has keys. I will assume that you have an array called zeroHamming and oneHamming.

# make an array of the distances with keys
# these must be input as pairs, not as columns
hammingDistances = np.array([(one,zero) for one,zero in zip(oneHamming,zeroHamming],\
    dtype=[("one","float"),("zero","float")])
# to see how the keys work, try:
print hammingDistances['zero']
# do a sort by oneHamming, then by zeroHamming
sortedIndsOneFirst = np.argsort(hammingDistances,order=['one','zero'])
# do a sort by zeroHamming, then by oneHamming
sortedIndsZeroFirst = np.argsort(hammingDistances,order=['zero','one'])

1 Comment

I have 2 questions: 1- what are "(one,zero) for one,zero" cause it gives me error 2- when you said "do a sort by oneHamming, then by zeroHamming" how exactly it work it try to sort by two criteria at the same time? or using second criteria only when have two or more elements with same first criteria?
-1

Its easier to work with as1 = allLabelPredict.T, because then as1[0] will be your first vector, as1[1] your second etc. Then, your hamming distance function is simply:

def ham(a1, b1): return sum(map(abs, a1-b1))

So, if you want the vectors that match your criterion, you can use composition:

vects = numpy.array( [ a for a in as1 if ham(a, myLabel) < 2 ] )

where, myLabel is the vector you want to compare with.

5 Comments

For such a large array, using numpy would probably speed things up significantly. Also don't recommend using as as a variable name. In fact, it might be illegal.
I don't see how your solution will solve my problem?
I want to select those vectors which have the minimum "One Hamming Distance" and "zero Hamming Distance". That is exactly what the above lines do. Also as is a keyword, so I changed it in the edit.
And the previous vect is myLabel in your question.
Down vote was presumably due to the as syntax error or the fact that you don't take into account the 'zero' vs. 'one' hamming distance. Also, this doesn't select the vectors with minimum one and zero hamming distance, it selects all vectors that have total hamming distance < 2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.