I've designed an algorithm to find the longest common subsequence. The steps are:
Start with i = 0
- Picks the first letter from the first string start from $i^{th}$ letter.
- Go to the second string looking for that picked letter.
- If not found return to the first string and picks the next letter and repeat 1 to 3 until it finds a letter that is in the second string.
- Now that found a common letter in the second string, adds that to
common_subsequence. - Store its position in
index. - Picks next letter from the first string and do step 2 but this time starts from
index. - Repeat 3 to 6 until reached end of string 1 or string 2.
- If length
common_subsequenceis greater than length of common subsequence so far add that changelcsto thecommon_subsequence. - Add 1 to the
iand repeat 1 to 9 whileiis less than the length of the first string.
Example
X = ABCBDAB Y = BDCABA
- First pick
A.- Look for
AinY.- Now that found
Aadd that to thecommon_subsequence.- Then pick
BfromX.- Look for
BinYbut this time start searching fromA.- Now pick
C. It isn't there in string 2, so pick the next letter inXthat isB.
The complexity of this algorithm is $\Theta(n\cdot m)$.
I implemented it using two methods. Here are my implementations.
- First
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
- Second Algorithm using hash table
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. Can someone elaborate why?