1

In Algorithm Design Manual, edit distance is solved by the following algorithm

#define INSERT    1       /* enumerated type symbol for insert */
#define DELETE    2       /* enumerated type symbol for delete */

int string_compare(char *s, char *t, int i, int j)
{
        int k;                  /* counter */
        int opt[3];             /* cost of the three options */
        int lowest_cost;        /* lowest cost */

        if (i == 0) return(j * indel(' '));
        if (j == 0) return(i * indel(' '));

        opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
        opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
        opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);

        lowest_cost = opt[MATCH];
        for (k=INSERT; k<=DELETE; k++)
                if (opt[k] < lowest_cost) lowest_cost = opt[k];

        return( lowest_cost );
}

I understand everything up to this point but am struggling to understand the following section where substring matching and longest common subsequence are solved as variations of the edit distance problem. I believe I kind of understand the intuition behind them, where the least amount of edits means preserving the "sequences of interest". In the case of substring matching, it is the substring; in the case of the longest common subsequence, it is that common subsequence. However, I don't understand how exactly each problem is solved.

For substring matching, following changes are made:

row_init(int i)
{
    m[0][i].cost = 0; /* note change */
    m[0][i].parent = -1; /* note change */
}
goal_cell(char *s, char *t, int *i, int *j)
{
    int k; /* counter */
    *i = strlen(s) - 1;
    *j = 0;
    for (k=1; k<strlen(t); k++)
        if (m[*i][k].cost < m[*i][*j].cost) *j = k;
    }
}

For longest common subsequence, the following change is made:

int match(char c, char d)
{
    if (c == d) return(0);
    else return(MAXLEN);
}

Would someone care to explain and help me understand this better?

1
  • For LCS: You can think of any edit distance computation as aligning the two strings, one on top of the other, by inserting special "gap" characters into both sequences in a way that maximises the score, which typically means in a way that causes as many columns as possible to contain "matching" characters. match() now penalises something so hard that it will never appear in any chosen solution. What does this imply about the resulting alignment? Where can gaps still appear? Commented Jan 4, 2022 at 0:28

1 Answer 1

1

There is an explanation of the substring matching problem in Section 21.4 of the book.

Explanation Screenshot -> https://i.sstatic.net/N56LP.png

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.