Algorithm for finding the longest common subsequence using hash table. Why is it slow?

Question

I've designed an algorithm to find the longest common subsequence. The steps are:

Start with i = 0

Picks the first letter from the first string start from $i^{th}$ letter.
Go to the second string looking for that picked letter.
If not found return to the first string and picks the next letter and repeat 1 to 3 until it finds a letter that is in the second string.
Now that found a common letter in the second string, adds that to common_subsequence.
Store its position in index.
Picks next letter from the first string and do step 2 but this time starts from index.
Repeat 3 to 6 until reached end of string 1 or string 2.
If length common_subsequence is greater than length of common subsequence so far add that change lcs to the common_subsequence.
Add 1 to the i and repeat 1 to 9 while i is less than the length of the first string.

Example‫‪
X = ABCBDAB‬‬  
‫‪Y = BDCABA‬‬ 
First pick A.

Look for A in Y.

Now that found A add that to the common_subsequence.

Then pick B from X.

Look for B in Y but this time start searching from A.

Now pick C. It isn't there in string 2, so pick the next letter in X that is B.

The complexity of this algorithm is $\Theta(n\cdot m)$.

I implemented it using two methods. Here are my implementations.

First

import time
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if string is empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        start = 0 # start position in ystr
        for item in xstr[i:]:
            index = ystr.find(item, start) # position at the common letter
            if index != -1: # if common letter has found
                cs += item # add common letter to the cs
                start = index + 1
            if index == len(ystr) - 1: break # if reached end of the ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

Second Algorithm using hash table

import time
from collections import defaultdict
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if strings are empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    location = defaultdict(list) # keeps track of items in the ystr
    i = 0
    for k in ystr:
        location[k].append(i)
        i += 1
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        index = -1
        reached_index = defaultdict(int)
        for item in xstr[i:]:
            for new_index in location[item][reached_index[item]:]:
                reached_index[item] += 1
                if index < new_index:
                    cs += item # add item to the cs
                    index = new_index
                    break
            if index == len(ystr) - 1: break # if reached end of the ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. Can someone elaborate why?

This might be better suited to stackoverflow.com, or better yet, to Code Review. To me, their FAQ seems to allow such questions as on topic. — Paresh
– Paresh, Commented Jan 9, 2013 at 16:48
The algorithm described here is cubic - O(m^2 * n) - not quadratic. The net is choc full not only with descriptions of the standard O(m*n) algorithm but also with explanations how to derive the algorithm from first principles... The most important optimisation here is to consult an introductory algorithm text or any of the gazillion resources out there on the 'net (including Rosetta Code, which has LCS implementations in plenty of languages). — DarthGizka
– DarthGizka, Commented Mar 24, 2016 at 15:39

Yuval Filmus · Accepted Answer · 2013-01-09 20:29:18Z

1

One optimization is to replace the access to reached_index[item] by a local variable which is initialized from the hash table at the beginning of the for item loop, and stored back to the hash table at the end of the loop.

Another optimization which will speed up things is to replace the hash tables with arrays of size 256.

answered Jan 9, 2013 at 20:29

Yuval Filmus

281k27 gold badges320 silver badges518 bronze badges

Add a comment |

Stack Exchange Network

Algorithm for finding the longest common subsequence using hash table. Why is it slow?

1 Answer 1

Your Answer

Hot Network Questions

Algorithm for finding the longest common subsequence using hash table. Why is it slow?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions