7

How can I measure similarity-percentage between two sequences of strings?

I have two text files and In files there sequences are written like

First file:

AAA BBB DDD CCC GGG MMM AAA MMM

Second file:

BBB DDD CCC MMM AAA MMM

How to measure similarity between these two files in terms of order of strings?

For example in above example both files have similarity due to order of strings is same however some strings are missing in file-2. What algorithm is best suitable to solve this problem so that I can measure how similar is order of strings not frequency of strings in two?

2 Answers 2

9

You could use the Levenstein Distance algorithm. It analyzes how many edits that is needed to transform one string into another. This article explains it pretty well, and a sample implementation is provided.

Copy paste from Codeproject:

1.  Set n to be the length of s. ("GUMBO")
    Set m to be the length of t. ("GAMBOL")
    If n = 0, return m and exit.
    If m = 0, return n and exit.
    Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements.
2.  Initialize v0 to 0..m.
3.  Examine each character of s (i from 1 to n).
4.  Examine each character of t (j from 1 to m).
5.  If s[i] equals t[j], the cost is 0.
    If s[i] is not equal to t[j], the cost is 1.
6.  Set cell v1[j] equal to the minimum of:
    a. The cell immediately above plus 1: v1[j-1] + 1.
    b. The cell immediately to the left plus 1: v0[j] + 1.
    c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost.
7.  After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m].
Sign up to request clarification or add additional context in comments.

Comments

6

You can use python's SequenceMatcher.ratio function which measures the sequences similarity as a float in the range [0, 1]. If T is the total number of elements in both sequences, and M is the number of matches, this is 2.0 * M / T. The main code is as follows:

from difflib import SequenceMatcher
text1 = 'AAA BBB DDD CCC GGG MMM AAA MMM'
text2 = 'BBB DDD CCC MMM AAA MMM'
s = SequenceMatcher(None, text1, text2)
similarity = s.ratio() * 100

I hope this could help you!

1 Comment

the number 2.0 is referred to what?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.