1

I have a function which takes two arrays containing the tokens/words of two texts and gives out the cosine similarity value which shows the relationship between both texts.

The function takes an array $tokensA (0=>house, 1=>bike, 2=>man) and an array $tokensB (0=>bike, 1=>house, 2=>car) and calculates the similarity which is given back as a floating point value.

function cosineSimilarity($tokensA, $tokensB) {
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();
    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.

Is it possible to use MySQL's spatial columns to reduce the number of comparisons?

I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.

MySQL's spatial columns

  • You create spatial columns with: CREATE TABLE abc (clmnName TYPE)
  • possible types are listed here
  • here is how I select the data later [e.g. MultiPointFromText() or AsText()]
  • You insert values like this: INSERT INTO clmnName VALUES (GeomFromText('POINT(1 1)'))

But how do you use this for my problem?

PS: I'm looking for ways to reduce the number of comparisons with algorithms in this question. Vinko Vrsalovic told me that I should open another question for the spatial features.

2 Answers 2

1

While R-Trees in general can index data with arbitrary number of dimensions, MySQL spatial abilities are only limited to Geometry types (2 dimensions).

If your vectors are 2-dimensional and you can normalize them, then do the following:

  • Split the circle into twice the number of angles which fit your differences
  • Find the MBR of vectors with given cosine difference from the center of each sector
  • Find all vectors within the MBR
  • Do the fine filtering for exact difference.

In this case, however, it will be better just to precaculate the angle of the value and index it with a plain B-Tree index.

Sign up to request clarification or add additional context in comments.

8 Comments

I've added some details about my function and the vectors which the function takes. Do you think your approach is possible with these?
Since your vectors are located on the surface of the orthotope, it would be possible if you had a fixed number of dimensions (that is a fixed set of tokens) and MySQL would be able to build an R-Tree on this number of dimensions. Since neither of these is possible, this solution is not viable too.
So I can forget about this approach with MySQL's spatial features and look for another way? No possibility?
@marco92w: of course never say "never" but as for now I don't see a way to use MySQL's spatial abilities here.
You can certainly implement an R-Tree in code, either extending MySQL or in PHP (or a PECL extension). It's not a minor endeavour, for sure.
|
1

In fact you have only 75 * 74 / 2 = 2775 comparisons. You compare every word with 74 others, but you don't need to compare word1 with word2 and again word2 with word1. So it gives half of comparisons less.

1 Comment

Thanks, that's right. :) But it's still a lot. And I don't compare words but texts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.