2

How to compare the similarity between two arrays? Say I have:

Base Array: [.5,0,0,0,.25,0,0,.25,0,0,0,0]

Array 1: [1,0,0,0,1,0,0,1,0,0,0,0]
Array 2: [0,0,1,0,0,0,1,0,0,1,0,0]
Array 3: [1,0,0,0,0,0,0,0,0,0,0,0]

Regarding the arrays above, the answer should be Array 1. The answer is Array 1 because, the array elements are 'closer' in structure to the array elements of the base array. Differing from Array 3, .25 is closer to 1 than 0. Another example:

Base Array: [.75,0,0,0,0,0,0,0,.25,0,0,0]

Array 1: [1,0,0,0,1,0,0,1,0,0,0,0]
Array 2: [0,0,1,0,0,0,1,0,0,1,0,0]
Array 3: [1,0,0,0,0,0,0,0,0,0,0,0]

Which in this case, Array 3 should then be the answer.

However, using my current algo (which I will give later), the answer becomes Array 3. Here is what I am using:

for (int i = 0; i < basearray.Length; i++)
{
  temp = (basearray[i] - arrayX[i]);
  dist += temp * temp;
}

So, I think there is something wrong with my algo? Or maybe, I need to use a 'different' kind of algorithm and not distance (since essentially, .25 IS closer to 0 than 1, but what I want is otherwise).

Thanks!

UPDATE:

I found the answer! Thanks for all those for the help. Here it is:

float[] pbaseArrX = new float[3];
float[] pcompArrX = new float[3];

float dist1 = 0, dist2 = 0;

for (int i = 0; i < baseArrX.Count; i++)
{
  pbaseArrX[i] = baseArrX[i] / (baseArrX[0] + baseArrX[1] + baseArrX[2]);
}

//Do the following for both compArr1 and compArr2;
for (int i = 0; i < compArrX.Count; i++)
{
  pcompArrX[i] = pcompArrX[i] / (pcompArrX[0] + pcompArrX[1] + pcompArr[2]);
}

//Get distance for both
for (int i = 0; i < pcompArrX.Count; i++)
{
  distX = distX + ((pcompArrX[i] - pbaseArrX[i])^2);
}

//Then just use conditional to determine which is 'closer'
6
  • 2
    So what exactly do you want, if not distance? I.e. why is Array 1 the correct answer? Commented Jun 17, 2011 at 7:15
  • Please expand on "otherwise." Commented Jun 17, 2011 at 7:17
  • Please give an example of arrayX values that would produce a result of Array 1. Commented Jun 17, 2011 at 7:20
  • Your algorithm computes (square of) euclidian distance (I guess dist initialized to 0 just before the loop). This is a perfectly reasonable choice and what would be done most of the time and with this measure, Array3 is indeed the closer one. It is of course possible to define a distance many other way, but you as you did not explain why distance to Array1 should be shorter, it is difficult to help you. Commented Jun 17, 2011 at 7:21
  • Sorry, what I meant to say was, the 'closer' an element is to the 'desired' element (disregarding 0), then additively, they should add to the similarity measure for that particular array. I.E. .25 is 'closer' to 1 than 0 is to 1. Am I making sense? Sorry. Commented Jun 17, 2011 at 7:23

4 Answers 4

5

It seems like you want to compare the arrays as rays (just direction), but you're comparing them as vectors (direction and magnitude). I'd suggest comparing the arrays with cosine similarity, which is just the cosine of the angle between the vectors and thus comparison of only their directions. For the arrays presented, the cosine similarity between the base array and array 1 is 0.94 while that with array 2 is 0.82, matching your expectations.

Sign up to request clarification or add additional context in comments.

2 Comments

+1 Interesting! Hadn't heard of cosine similarity before and makes sense why that might be used. Could well be the solution to the question.
Hi! Thanks for this! This Cosine Similarity algorithm is good for the time being! I tested this and there are sometimes cases in which I come up with a wrong answer, but while I do not yet have an algo, Ill use this. Thanks!
4

Array 3 is the correct answer. The algorithm you're using is giving you the right result.

Basically, for me Array 3 is more similar to the base Array than Array1. What's the pattern that you're looking for? You say Array1 should be the result... why?

Distance is just a way to compare two arrays by an arbitrary mathematical asumption, there has no real "logic" behind it, but that we give it to it.

If you want the result to be Array1 then:

  • Define WHY Array1 shall be the result from logical terms.
  • Translate WHY Array1 shall be the result to a mathematical formulation
  • Implement that formulation

Comments

4

The problem here is that your concept of "similarity" is not clearly defined. Depending on the use case of the data, there are infinitely many ways to define similarity. Leaving your array aside there is an simple example for that:

  • Glasses and a Binocular are similar, because you use both of them to look at things.
  • Glasses and a Bicycle are similar, because both consist of two circles linked with each other
  • Glasses and Grass are similary, because both start with "G" and end with "S"

As you can see unless you define exactly what you need anything can be similar to anything. Humans are good to use the right kind of similarity for the right task, but a computer wont be able to do that, unless you tell it explicitly what you want.

Leaving this point aside, there is one common case of similarity, that is quite often used for sequence data in data mining. This is called the cosine distance, and it is not that different from what you are using. It is called the cosine distance. Here is the algorithm:

for (int i = 0; i < basearray.Length; i++)
{
  temp += (basearray[i] * arrayX[i]);
  f_base += (basearray[i] * basearray[i]);
  f_array += (array[i] * array[i]);
}
dist = 1 - (temp / sqrt( f_base * f_array ));

This is basically just computing the "Angle" between both array depicted as points in n-Dimensional space. Works just fine in most cased and can be adopted easily to other needs (when other kinds of similarity are needed).

5 Comments

I think you're missing a square root for the denominator of the fraction in the last line.
@Michael thanks for the hint... yes I accidentally forgot that part. Will edit it in
I see another couple of bugs, too. I've made the fixes, but have insufficient reputation for those to show up without review.
@Michael Thanks for the help. I guess I shouldn't be posting without enough coffee in the morning. I definitely need to fetch some, while my code is compiling.
Another hint on this side: Keep your arrays normalized when doing data mining tasks. This will safe you from bugs at other places (normally if I wrote the above code, the sqrt and the division wont be there).
2

Mathematically, each array is a point and the distance measure is called a norm. You are using a version of the Euclidean norm which is our standard measure of spatial distance in three dimensions. It's just missing the square root out because all you're interested in which one is closest as opposed to measuring actual distance so it would still work for you.

In your example, the third array is definitely closest in Euclidean distance, because your base array is a heck of a lot closer to a zero array than your first array is. They may have "similar structure" but you're looking at it in the wrong way. Your distance measure is interested in numerical distance, and 0 (in array 3) is a lot closer to 0.25 than 1 is (in array 1).

If you're looking at "structure" it means you think 0 is a lot more significant than any other number. i.e. you want to reward a matching array for having non-zero's in the same place, rather than being numerically close to 0.

I'm not sure what sort of norm you want for that and, to be honest, this gives me the impression we're missing out on what you need to achieve at the end of the day - it's a bit difficult to make suggestions on what we know so far.

4 Comments

Hamming distance would match the structural difference (using "structure" loosely, here, no formal definition in mind) you're giving.
@Michael J. Barber - I'm not sure how one implements Hamming Distance using real numbers. I tend to feel your suggestion of cosine similarity is more likely to be useful, but I think the question is a little vague on the final destination and get the feeling user488792 doesn't have a formal requirement either, but someone can tell him, on a case-by-case basis, whether something looks OK/NOT OK.
Easiest way for the Hamming distance would be to just make a new base array with binary values. I don't know how to do it, otherwise.
He really needs to refine what he means by similar. Maybe he really wants to use something like Mahalanobis distance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.