0

I have available to me hundreds of JSON strings. Each of these contains an array of 15-20 words sorted by some predetermined weight. This weight, if it's worth noting, is the amount of times these words are found in some chunk of text. What's the best way of finding similarity between arrays of words that are structured like this?

First idea that came to my head was to create a numerical hash of all the words together and basically compare these values to determine similarity. I wasn't very successful with this, since the resulting hash values of very similar strings were not very close. After some research regarding string comparison algorithms, I come to Stackoverflow in hopes of receiving more guidance. Thanks in advance, and please let me know if you need more details of the problem.

Edit 1: Clarifying what I'm trying to do: I want to determine how similar two arrays are according to the words each of these have. I would also like to take into consideration the weight each word carries in each array. For example:

var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];

In that example, array 4 and array 2 are more similar than array 2 and array 3 because, even though both have the same words, the weight is the same for both of them in array 4 and 2. I hope that makes it a little bit easier to understand. Thanks in advance.

6
  • So you have N arrays with Nm words each, and you want to determine what exactly? Commented Sep 26, 2011 at 1:23
  • I edited my original post with some clarification. Hope that helps and thanks for interest. Commented Sep 26, 2011 at 1:34
  • what is more similar head and hade or head and cavesa? Commented Sep 26, 2011 at 1:38
  • Hi Itay, I'm not interested in how two words are similar, I'm interested in how two arrays are similar, in the fact that they share some words. Commented Sep 26, 2011 at 1:42
  • Are you looking for an algorithm to determine the overall similarity of two strings, or are you looking for how to loop through all the arrays to do this comparison? Or both? Commented Sep 26, 2011 at 1:51

4 Answers 4

3

I think that what you want is "cosine similarity", and you might also want to look at vector space models. If you are coding In Java, you can use the open source S-space package.

(added on 31 Oct) Each element of the vector is the count of one particular string. You just need to transform your arrays of strings into such vectors. In your example, you have three words - "hill", "head", "valley". If your vector is in that order, the vectors corresponding to the arrays would be

// array: #hill, #head, #valley
array1:  {5,     5,     0}
array2:  {0,     5,     7}
array3:  {0,     6,     5}
array4:  {0,     5,     7}
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for the suggestion. Even though this is very helpful and interesting material, in this scenario I am not interested in comparing the similarity of the strings themselves. I only care if they're the same or not. In this case, I'm comparing similarity of arrays of strings.
@Xavier - Yes, that is what cosine similarity does. Each element of the vector is the count of one particular string. You just need to transform your array of strings into such a vector. In your example, you have three words - "hill", "head", "valley". If your vector is in that order, the vector corresponding to array1 would be {5, 5, 0}.
Interesting, kc2001. Thanks for getting back to me. I still don't understand completely, I have to admit. In the case you explained, how would a vector containing only counts helps me to compare arrays? In other words, where in that vector is the information that contains what the actual string is and not just the count of the string? I saw some examples researching the web where they make an alphabet of strings [abcde] and then a vector based on the union of characters between two strings. These two vectors are then compared using cosine similarity Are you suggesting a similar approach here?
I think I understand now. I have to perform an union operation between arrays, where each element is a word. Once I have a single vector with all the words in both arrays, I then convert each word array into it by replacing words that aren't in the array with a 0, and using the count otherwise. Once I'm done with that, I should have two vectors on which I can perform cosine similarity. Is this the right approach?
@Xavier - Yes, that's basically it.
1

Given that each array has to be compared to every other array, you are looking at a serious amount of processing along the lines of ∑(n-1) times the average number of "words" in each array. You'll need to store the score for each comparison, then make some sense of it.

e.g.

var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];

// Comparison score is summed product of matching word counts
function compareThings() {

  var a, b, i = arguments.length,
      j, m, mLen, n, nLen;
  var word, score, result = [];

  if (i < 2) return;

  // For each array
  while (i--) {
    a = arguments[i];
    j = i;

    // Compare with every other array
    while (j--) {
      b = arguments[j];
      score = 0;

      // For each word in array
      for (m=0, mLen = b.length; m<mLen; m++) {
        word = b[m].word

        // Compare with each word in other array
        for (n=0, nLen=a.length; n<nLen; n++) {

          // Add to score
          if (a[n].word == word) {
            score += a[n].count * b[m].count;
          }
        }
      }

      // Put score in result
      result.push(i + '-' + j + ':' + score);
    }
  }
  return result;
}

var results = compareThings(array1, array2, array3, array4);

alert('Raw results:\n' + results.join('\n'));
/*
Raw results:
3-2:65
3-1:74
3-0:25
2-1:65
2-0:30
1-0:25
*/

results.sort(function(a, b) {
  a = a.split(':')[1];
  b = b.split(':')[1];
  return b - a;
});

alert('Sorted results:\n' + results.join('\n'));
/*
Sorted results:
3-1:74
3-2:65
2-1:65
2-0:30
3-0:25
1-0:25
*/

So 3-1 (array4 and array2) have the highest score. Fortunately the comparison need only be one way, you don't have to compare a to b and b to a.

3 Comments

Thanks RobG. Any specific reason why you're calculating similarities by multiplying the weights instead of subtracting them like in other suggestions provided here? I like it because it does what I want for the cases I tested, but it's as if this number was arbitrary and unpredictable. For example, if you have two arrays with one identical word but with huge weights in one of the arrays, it will result to be more similar to one that has more similar words with less weights. Still, this is a good start and I thank you for your effort.
I suppose whether "weights" are added or multiplied depends on you background. In the statistical analysis work I've done, weights are like probabilities so values are multiplied by them. Some real world examples are sailing handicaps (where races are of varying lengths and conditions so elapsed times are multiplied by a handicap) and adjusting survey control networks, where each measurement has a different accuracy (e.g. +-10mm) and so has a different weight in the adjustment.
I understand, it certainly depends on the approach I want to take. Thanks, RobG.
1

Here is an attempt. The algorithm is not very smart (a difference > 20 is the same as not having the same words), but could be a useful start:

var wordArrays = [
    [{"word":"hill","count":5},{"word":"head","count":5}]
  , [{"word":"valley","count":7},{"word":"head","count":5}]
  , [{"word":"head", "count": 6}, {"word": "valley", "count": 5}]
  , [{"word": "valley", "count": 7}, {"word":"head", "count": 5}]
]

function getSimilarTo(index){
    var src = wordArrays[index]
      , values

    if (!src) return null;

    // compare with other arrays
    weighted = wordArrays.map(function(arr, i){
        var diff = 0
        src.forEach(function(item){
            arr.forEach(function(other){
                if (other.word === item.word){
                    // add the absolute distance in count
                    diff += Math.abs(item.count - other.count)
                } else {
                    // mismatches
                    diff += 20
                }
            })
        })
        return {
            arr   : JSON.stringify(arr)
          , index : i
          , diff  : diff
        }
    })

    return weighted.sort(function(a,b){
        if (a.diff > b.diff) return 1
        if (a.diff < b.diff) return -1
        return 0
    })
}

/*
getSimilarTo(3)
[ { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
    index: 1,
    diff: 100 },
  { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
    index: 3,
    diff: 100 },
  { arr: '[{"word":"head","count":6},{"word":"valley","count":5}]',
    index: 2,
    diff: 103 },
  { arr: '[{"word":"hill","count":5},{"word":"head","count":5}]',
    index: 0,
    diff: 150 } ]
*/

Comments

1

Sort the arrays by word before attempting comparison. Once this is complete, comparing two arrays will require exactly 1 pass through each array.

After sorting the arrays, here is a compare algorithm (psuedo-java):


int compare(array1, array2)
{
  returnValue = 0;
  array1Index = 0
  array2Index = 0;

  while (array1Index < array1.length)
  {
    if (array2Index < array2.length)
    {
      if (array1[array1Index].word == array2[array2Index].word) // words match.
      {
        returnValue += abs(array1[array1Index].count - array2[array2Index].count);
        ++array1Index;
        ++array2Index;
      }
      else // account for the unmatched array2 word.
      {
        // 100 is just a number to give xtra weight to unmatched numbers.
        returnValue += 100 + array2[array2Index].count;
        ++array2Index;
      }
    }
    else // array2 empty and array1 is not empty.
    {
      // 100 is just a number to give xtra weight to unmatched numbers.
      returnValue += 100 + array1[array1Index].count;
    }
  }

  // account for any extra unmatched array 2 values.
  while (array2Index < array2.length)
  {
      // 100 is just a number to give xtra weight to unmatched numbers.
      returnValue += 100 + array2[array2Index].count;
  }

  return returnValue;
}

1 Comment

DwB, thanks for your answer! Your method is intriguing in the sense that it allows the algorithm to traverse each array only once. But what I don't see in this implementation is, what happens when you don't find a word in array2? You will keep going to the inner else statement until the first if condition fails and you get out of the while loop without finding a match, even though you haven't tried any of the other words in array1. In fact, this comparison fails in that case because it will stay in an infinite loop. Thanks for your suggestion at this point though, it's a very helpful start.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.