Comparing arrays of strings for similarity

Question

I have available to me hundreds of JSON strings. Each of these contains an array of 15-20 words sorted by some predetermined weight. This weight, if it's worth noting, is the amount of times these words are found in some chunk of text. What's the best way of finding similarity between arrays of words that are structured like this?

First idea that came to my head was to create a numerical hash of all the words together and basically compare these values to determine similarity. I wasn't very successful with this, since the resulting hash values of very similar strings were not very close. After some research regarding string comparison algorithms, I come to Stackoverflow in hopes of receiving more guidance. Thanks in advance, and please let me know if you need more details of the problem.

Edit 1: Clarifying what I'm trying to do: I want to determine how similar two arrays are according to the words each of these have. I would also like to take into consideration the weight each word carries in each array. For example:

var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];

In that example, array 4 and array 2 are more similar than array 2 and array 3 because, even though both have the same words, the weight is the same for both of them in array 4 and 2. I hope that makes it a little bit easier to understand. Thanks in advance.

So you have N arrays with Nm words each, and you want to determine what exactly? — Rusty Fausak
– Rusty Fausak, Commented Sep 26, 2011 at 1:23
I edited my original post with some clarification. Hope that helps and thanks for interest. — Xavier E. López
– Xavier E. López, Commented Sep 26, 2011 at 1:34
Hi Itay, I'm not interested in how two words are similar, I'm interested in how two arrays are similar, in the fact that they share some words. — Xavier E. López
– Xavier E. López, Commented Sep 26, 2011 at 1:42
Are you looking for an algorithm to determine the overall similarity of two strings, or are you looking for how to loop through all the arrays to do this comparison? Or both? — Keith Twombley
– Keith Twombley, Commented Sep 26, 2011 at 1:51

kc2001 · Accepted Answer · 2014-11-10 13:05:44Z

3

I think that what you want is "cosine similarity", and you might also want to look at vector space models. If you are coding In Java, you can use the open source S-space package.

(added on 31 Oct) Each element of the vector is the count of one particular string. You just need to transform your arrays of strings into such vectors. In your example, you have three words - "hill", "head", "valley". If your vector is in that order, the vectors corresponding to the arrays would be

// array: #hill, #head, #valley
array1:  {5,     5,     0}
array2:  {0,     5,     7}
array3:  {0,     6,     5}
array4:  {0,     5,     7}

edited Nov 10, 2014 at 13:05

answered Oct 3, 2011 at 0:27

kc2001

5,2495 gold badges58 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Xavier E. López Over a year ago

Thank you for the suggestion. Even though this is very helpful and interesting material, in this scenario I am not interested in comparing the similarity of the strings themselves. I only care if they're the same or not. In this case, I'm comparing similarity of arrays of strings.

kc2001 Over a year ago

@Xavier - Yes, that is what cosine similarity does. Each element of the vector is the count of one particular string. You just need to transform your array of strings into such a vector. In your example, you have three words - "hill", "head", "valley". If your vector is in that order, the vector corresponding to array1 would be {5, 5, 0}.

Xavier E. López Over a year ago

Interesting, kc2001. Thanks for getting back to me. I still don't understand completely, I have to admit. In the case you explained, how would a vector containing only counts helps me to compare arrays? In other words, where in that vector is the information that contains what the actual string is and not just the count of the string? I saw some examples researching the web where they make an alphabet of strings [abcde] and then a vector based on the union of characters between two strings. These two vectors are then compared using cosine similarity Are you suggesting a similar approach here?

Xavier E. López Over a year ago

I think I understand now. I have to perform an union operation between arrays, where each element is a word. Once I have a single vector with all the words in both arrays, I then convert each word array into it by replacing words that aren't in the array with a 0, and using the count otherwise. Once I'm done with that, I should have two vectors on which I can perform cosine similarity. Is this the right approach?

kc2001 Over a year ago

@Xavier - Yes, that's basically it.

RobG · Accepted Answer · 2011-09-26 03:16:34Z

1

Given that each array has to be compared to every other array, you are looking at a serious amount of processing along the lines of ∑(n-1) times the average number of "words" in each array. You'll need to store the score for each comparison, then make some sense of it.

e.g.

var array1 = [{"word":"hill","count":5},{"word":"head","count":5}];
var array2 = [{"word":"valley","count":7},{"word":"head","count":5}];
var array3 = [{"word":"head", "count": 6}, {"word": "valley", "count": 5}];
var array4 = [{"word": "valley", "count": 7}, {"word":"head", "count": 5}];

// Comparison score is summed product of matching word counts
function compareThings() {

  var a, b, i = arguments.length,
      j, m, mLen, n, nLen;
  var word, score, result = [];

  if (i < 2) return;

  // For each array
  while (i--) {
    a = arguments[i];
    j = i;

    // Compare with every other array
    while (j--) {
      b = arguments[j];
      score = 0;

      // For each word in array
      for (m=0, mLen = b.length; m<mLen; m++) {
        word = b[m].word

        // Compare with each word in other array
        for (n=0, nLen=a.length; n<nLen; n++) {

          // Add to score
          if (a[n].word == word) {
            score += a[n].count * b[m].count;
          }
        }
      }

      // Put score in result
      result.push(i + '-' + j + ':' + score);
    }
  }
  return result;
}

var results = compareThings(array1, array2, array3, array4);

alert('Raw results:\n' + results.join('\n'));
/*
Raw results:
3-2:65
3-1:74
3-0:25
2-1:65
2-0:30
1-0:25
*/

results.sort(function(a, b) {
  a = a.split(':')[1];
  b = b.split(':')[1];
  return b - a;
});

alert('Sorted results:\n' + results.join('\n'));
/*
Sorted results:
3-1:74
3-2:65
2-1:65
2-0:30
3-0:25
1-0:25
*/

So 3-1 (array4 and array2) have the highest score. Fortunately the comparison need only be one way, you don't have to compare a to b and b to a.

edited Sep 26, 2011 at 3:16

answered Sep 26, 2011 at 3:06

RobG

148k32 gold badges180 silver badges216 bronze badges

3 Comments

Xavier E. López Over a year ago

Thanks RobG. Any specific reason why you're calculating similarities by multiplying the weights instead of subtracting them like in other suggestions provided here? I like it because it does what I want for the cases I tested, but it's as if this number was arbitrary and unpredictable. For example, if you have two arrays with one identical word but with huge weights in one of the arrays, it will result to be more similar to one that has more similar words with less weights. Still, this is a good start and I thank you for your effort.

RobG Over a year ago

I suppose whether "weights" are added or multiplied depends on you background. In the statistical analysis work I've done, weights are like probabilities so values are multiplied by them. Some real world examples are sailing handicaps (where races are of varying lengths and conditions so elapsed times are multiplied by a handicap) and adjusting survey control networks, where each measurement has a different accuracy (e.g. +-10mm) and so has a different weight in the adjustment.

Xavier E. López Over a year ago

I understand, it certainly depends on the approach I want to take. Thanks, RobG.

Ricardo Tomasi · Accepted Answer · 2011-09-26 03:30:42Z

Here is an attempt. The algorithm is not very smart (a difference > 20 is the same as not having the same words), but could be a useful start:

var wordArrays = [
    [{"word":"hill","count":5},{"word":"head","count":5}]
  , [{"word":"valley","count":7},{"word":"head","count":5}]
  , [{"word":"head", "count": 6}, {"word": "valley", "count": 5}]
  , [{"word": "valley", "count": 7}, {"word":"head", "count": 5}]
]

function getSimilarTo(index){
    var src = wordArrays[index]
      , values

    if (!src) return null;

    // compare with other arrays
    weighted = wordArrays.map(function(arr, i){
        var diff = 0
        src.forEach(function(item){
            arr.forEach(function(other){
                if (other.word === item.word){
                    // add the absolute distance in count
                    diff += Math.abs(item.count - other.count)
                } else {
                    // mismatches
                    diff += 20
                }
            })
        })
        return {
            arr   : JSON.stringify(arr)
          , index : i
          , diff  : diff
        }
    })

    return weighted.sort(function(a,b){
        if (a.diff > b.diff) return 1
        if (a.diff < b.diff) return -1
        return 0
    })
}

/*
getSimilarTo(3)
[ { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
    index: 1,
    diff: 100 },
  { arr: '[{"word":"valley","count":7},{"word":"head","count":5}]',
    index: 3,
    diff: 100 },
  { arr: '[{"word":"head","count":6},{"word":"valley","count":5}]',
    index: 2,
    diff: 103 },
  { arr: '[{"word":"hill","count":5},{"word":"head","count":5}]',
    index: 0,
    diff: 150 } ]
*/

DwB · Accepted Answer · 2011-10-03 00:58:45Z

1

Sort the arrays by word before attempting comparison. Once this is complete, comparing two arrays will require exactly 1 pass through each array.

After sorting the arrays, here is a compare algorithm (psuedo-java):


int compare(array1, array2)
{
  returnValue = 0;
  array1Index = 0
  array2Index = 0;

  while (array1Index < array1.length)
  {
    if (array2Index < array2.length)
    {
      if (array1[array1Index].word == array2[array2Index].word) // words match.
      {
        returnValue += abs(array1[array1Index].count - array2[array2Index].count);
        ++array1Index;
        ++array2Index;
      }
      else // account for the unmatched array2 word.
      {
        // 100 is just a number to give xtra weight to unmatched numbers.
        returnValue += 100 + array2[array2Index].count;
        ++array2Index;
      }
    }
    else // array2 empty and array1 is not empty.
    {
      // 100 is just a number to give xtra weight to unmatched numbers.
      returnValue += 100 + array1[array1Index].count;
    }
  }

  // account for any extra unmatched array 2 values.
  while (array2Index < array2.length)
  {
      // 100 is just a number to give xtra weight to unmatched numbers.
      returnValue += 100 + array2[array2Index].count;
  }

  return returnValue;
}

answered Oct 3, 2011 at 0:58

DwB

38.6k11 gold badges62 silver badges86 bronze badges

1 Comment

Xavier E. López Over a year ago

DwB, thanks for your answer! Your method is intriguing in the sense that it allows the algorithm to traverse each array only once. But what I don't see in this implementation is, what happens when you don't find a word in array2? You will keep going to the inner else statement until the first if condition fails and you get out of the while loop without finding a match, even though you haven't tried any of the other words in array1. In fact, this comparison fails in that case because it will stay in an infinite loop. Thanks for your suggestion at this point though, it's a very helpful start.

Collectives™ on Stack Overflow

Comparing arrays of strings for similarity

4 Answers 4

5 Comments

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related