Skip to main content
2 of 10
added 23 characters in body

Finding the longest common subsequence algorithm using hash table Slow

I've designed an algorithm to find the longest common subsequence. These are the steps:

Starts with i = 0

  1. Picks the first letter from the first string start from ith letter.
  2. Goes to the second string looking for that picked letter.
  3. If not found returns to the first string and picks the next letter and repeats 1 to 3 until it finds a letter that is in the second string.
  4. Now that found a common letter in the second string, adds it to common_subsequence.
  5. Stores its position in index.
  6. Picks next letter from the first string and do step 2 but this time starts from $index.
  7. Repeats 3 to 6 until it reaches to the end of string 1 or string 2.
  8. If the length of common_subsequence is greater than the length of common subsequence that found so far, assigns common_subsequence to lcs.
  9. Adds 1 to i and repeats 1 to 9 until i is equal to the length of the first string.

Here is an example:
‫‪

X=A, B, C, B, D, A, B‬‬  
‫‪Y=B, D, C, A, B, A‬‬ 
  1. First picks A.
  2. Look for A in Y.
  3. Now that found A adds it to common_subsequence.
  4. Picks B from X.
  5. Looks for B in Y but this time starts searching from A.
  6. Picks C. It dosen't exist in string 2, so picks the next letter in X that is B.
    ...
    ...
    ...

The complexity of this algorithm is theta(n*m).

I implemented it on the two methods. The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. I cant undrestand why.

Here is my implementation:

First algorithm:

import time
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if string is empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        start = 0 # start position in ystr
        for item in xstr[i:]:
            index = ystr.find(item, start) # position at the common letter
            if index != -1: # if common letter is found
                cs += item # add common letter to the cs
                start = index + 1
            if index == len(ystr) - 1: break # if reached to the end of ystr
        # updates lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

Second one using hash table:

import time
from collections import defaultdict
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if strings are empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    location = defaultdict(list) # keeps track of items in the ystr
    i = 0
    for k in ystr:
        location[k].append(i)
        i += 1
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        index = -1
        reached_index = defaultdict(int)
        for item in xstr[i:]:
            for new_index in location[item][reached_index[item]:]:
                reached_index[item] += 1
                if index < new_index:
                    cs += item # add item to the cs
                    index = new_index
                    break
            if index == len(ystr) - 1: break # if reached to the end of ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed