Substring Matching and Longest Common Subsequence as Variation of Edit Distance Problem -- Skiena

Question

In Algorithm Design Manual, edit distance is solved by the following algorithm

#define INSERT    1       /* enumerated type symbol for insert */
#define DELETE    2       /* enumerated type symbol for delete */

int string_compare(char *s, char *t, int i, int j)
{
        int k;                  /* counter */
        int opt[3];             /* cost of the three options */
        int lowest_cost;        /* lowest cost */

        if (i == 0) return(j * indel(' '));
        if (j == 0) return(i * indel(' '));

        opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
        opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
        opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);

        lowest_cost = opt[MATCH];
        for (k=INSERT; k<=DELETE; k++)
                if (opt[k] < lowest_cost) lowest_cost = opt[k];

        return( lowest_cost );
}

I understand everything up to this point but am struggling to understand the following section where substring matching and longest common subsequence are solved as variations of the edit distance problem. I believe I kind of understand the intuition behind them, where the least amount of edits means preserving the "sequences of interest". In the case of substring matching, it is the substring; in the case of the longest common subsequence, it is that common subsequence. However, I don't understand how exactly each problem is solved.

For substring matching, following changes are made:

row_init(int i)
{
    m[0][i].cost = 0; /* note change */
    m[0][i].parent = -1; /* note change */
}
goal_cell(char *s, char *t, int *i, int *j)
{
    int k; /* counter */
    *i = strlen(s) - 1;
    *j = 0;
    for (k=1; k<strlen(t); k++)
        if (m[*i][k].cost < m[*i][*j].cost) *j = k;
    }
}

For longest common subsequence, the following change is made:

int match(char c, char d)
{
    if (c == d) return(0);
    else return(MAXLEN);
}

Would someone care to explain and help me understand this better?

For LCS: You can think of any edit distance computation as aligning the two strings, one on top of the other, by inserting special "gap" characters into both sequences in a way that maximises the score, which typically means in a way that causes as many columns as possible to contain "matching" characters. match() now penalises something so hard that it will never appear in any chosen solution. What does this imply about the resulting alignment? Where can gaps still appear? — j_random_hacker
– j_random_hacker, Commented Jan 4, 2022 at 0:28

ChenChen · Accepted Answer · 2022-06-03 10:37:10Z

1

There is an explanation of the substring matching problem in Section 21.4 of the book.

Explanation Screenshot -> https://i.sstatic.net/N56LP.png

answered Jun 3, 2022 at 10:37

ChenChen

111 bronze badge

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Substring Matching and Longest Common Subsequence as Variation of Edit Distance Problem -- Skiena

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related