In Algorithm Design Manual, edit distance is solved by the following algorithm
#define INSERT 1 /* enumerated type symbol for insert */
#define DELETE 2 /* enumerated type symbol for delete */
int string_compare(char *s, char *t, int i, int j)
{
int k; /* counter */
int opt[3]; /* cost of the three options */
int lowest_cost; /* lowest cost */
if (i == 0) return(j * indel(' '));
if (j == 0) return(i * indel(' '));
opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);
lowest_cost = opt[MATCH];
for (k=INSERT; k<=DELETE; k++)
if (opt[k] < lowest_cost) lowest_cost = opt[k];
return( lowest_cost );
}
I understand everything up to this point but am struggling to understand the following section where substring matching and longest common subsequence are solved as variations of the edit distance problem. I believe I kind of understand the intuition behind them, where the least amount of edits means preserving the "sequences of interest". In the case of substring matching, it is the substring; in the case of the longest common subsequence, it is that common subsequence. However, I don't understand how exactly each problem is solved.
For substring matching, following changes are made:
row_init(int i)
{
m[0][i].cost = 0; /* note change */
m[0][i].parent = -1; /* note change */
}
goal_cell(char *s, char *t, int *i, int *j)
{
int k; /* counter */
*i = strlen(s) - 1;
*j = 0;
for (k=1; k<strlen(t); k++)
if (m[*i][k].cost < m[*i][*j].cost) *j = k;
}
}
For longest common subsequence, the following change is made:
int match(char c, char d)
{
if (c == d) return(0);
else return(MAXLEN);
}
Would someone care to explain and help me understand this better?
match()now penalises something so hard that it will never appear in any chosen solution. What does this imply about the resulting alignment? Where can gaps still appear?