I'm building a website that should collect various news feeds and would like the texts to be compared for similarity. What i need is some sort of a news text similarity algorithm. I know that php has the similar_text function and am not sure how good it is + i need it for javascript. So if anyone could point me to an example or a plugin or any instruction on how this is possible or at least where to look and start investigating.
-
4Why do you specifically need it in JS? You realize this will have to happen when users visit the site, it wont be something that you could necessarily run a cron job on and save on the server (well not as easily as a serverside language)Loktar– Loktar2011-02-18 15:05:33 +00:00Commented Feb 18, 2011 at 15:05
-
3@Loktar - There is serverside javascript too. :) And ofcourse it could be done in (clientside) Javascript as well by just retreiving the feeds and merge them on the client. It will save you a server that supports server side scripting.GolezTrol– GolezTrol2011-02-18 15:09:15 +00:00Commented Feb 18, 2011 at 15:09
-
Yeah but something like this is traditionally done on the server, and would be much faster.. plus you could do it once, cache the results and serve to new users. I doubt the OP was referring to something like NodeJS :PLoktar– Loktar2011-02-18 15:14:49 +00:00Commented Feb 18, 2011 at 15:14
-
The point is that there is an admin panel where admins group similar news together. I have quite a lot of options to make it easier for them but i need the use of when they select a title it compares its text to all other news texts and stick's out the most likely to be similar.fswd– fswd2011-02-18 15:20:52 +00:00Commented Feb 18, 2011 at 15:20
3 Answers
There's a javascript implementation of the Levenshtein distance metric, which is often used for text comparisons. If you want to compare whole articles or headlines though you might be better off looking at intersections between the sets of words that make up the text (and frequencies of those words) rather than just string similarity measures.
Comments
The question whether two texts are similar is a philosophical one as long as you don't specify exactly what it should mean. Consider the Strings "house" and "mouse". Seen from a semantic level they are not very similar, but they are very similar regarding their "physical appearance", because only one letter is different (and in this case you could go by Levenshtein distance).
To decide about similarity you need an appropriate text representation. You could – for instance – extract and count all n-grams and compare the two resulting frequency-vectors using a similarity measure as e.g. cosine similarity. Or you could stem the words to their root form after having removed all stopwords, sum up their occurrences and use this as input for a similarity measure.
There are plenty approaches and papers about that topic, e.g. this one about short texts. In any case: The higher the abstraction level where you want to decide if two texts are similar the more difficult it will get. I think your question is a non-trivial one (and hence my answer rather abstract) ... ;-)
Comments
In case anyone needs it, here's a basic implementation of an n-gram-similarity using modern browser Set functions for simplicity (and hopefully performance). Written in TypeScript.
// generate n-grams
const nGrams = (value: string, n = 3) => {
const pad = " ".repeat(n - 1);
value = pad + value + pad;
return Array(value.length - n + 1)
.fill("")
.map((_, index) => value.slice(index, index + n));
};
// calculate similarity score
const nGramSimilarity = (stringA: string, stringB: string, n = 3) => {
if (stringA === stringB) return 1;
const a = new Set(nGrams(stringA, n));
const b = new Set(nGrams(stringB, n));
const common = a.intersection(b);
const total = a.union(b);
return common.size / (total.size || Infinity);
};
References:
- https://github.com/StephanGeorg/trigram-similarity/blob/main/src/index.js
- PostgreSQL, trigrams and similarity
- https://github.com/words/n-gram/blob/main/index.js
- https://doxygen.postgresql.org/trgm__op_8c.html#ae8a02db8ac133296c3c825c110175dd4
A few things not clear to me from reading these references:
- Why is sorting the n-grams necessary when we're only comparing the number of entries in each set? Only place I could see sorting coming into play is if you wanted to have an ordered list of most-to-least-similar terms.
- Why are (in the case of tri-gram) 2 spaces put at the start, and only 1 at the end? In my code I just made the padding equal on both sides.
I think this is also an improvement over the existing trigram-similarity javascript library, which does some unnecessary operations
and couldn't make use of the new Set methods. In my testing with filtering a list of 400k items, the library took ~4.5s and my simplified version took ~1.5s.