Suffix search - Python

Question

Here's a the problem, provided a list of strings and a document find the shortest substring that contains all the strings in the list.

Thus for:

document = "many google employees can program because google is a technology company that can program"
searchTerms = ['google', 'program', 'can']

the output should be:

"can program because google"  # 27 chars

and not:

"google employees can program"  # 29 chars
"google is a technology company that can program"  # 48 chars

Here's my approach, Split the document into suffix tree, check for all strings in each suffix return the one of the shortest length,

Here's my code

def snippetSearch(document, searchTerms):
    doc = document.split()
    suffix_array = create_suffix_array(doc)
    current = None
    current_len = sys.maxsize
    for suffix in suffix_array:
        if check_for_terms_in_array(suffix, searchTerms):
            if len(suffix) < current_len:
                current_len = len(suffix)
                current = suffix    

    return ' '.join(map(str, current))


def create_suffix_array(document):
    suffix_array = []
    for i in range(len(document)):
        sub = document[i:]
        suffix_array.append(sub)
    return suffix_array


def check_for_terms_in_array(arr, terms):
    for term in terms:
        if term not in arr:
            return False

    return True

This is an online submission and it's not passing one test case. I have no idea what the test case is though. My question is, is there anything logically incorrect with the code. Also is there a more efficient way of doing this.

You are making the assumption that substrings are to end at a word boundary. Is that covered by the question? Eg. does uncanned ravioli contain can;? — dhke
– dhke, Commented Aug 18, 2016 at 15:13
"shortest substring that contains all the strings in the list" - I don't see what this has to do with suffixes. — machine yearning
– machine yearning, Commented Aug 18, 2016 at 15:13
@CodeMonkey if this is CodeWars you can modify your script to print the test cases that are fed to your script. — Ma0
– Ma0, Commented Aug 18, 2016 at 15:17
It seems like the example test case you gave doesn't match the rules you described: ` "... program can google ..."` is shorter than your example output. Am I missing something there? — Aaron
– Aaron, Commented Aug 18, 2016 at 15:27
@CodeMonkey "better" is relative, I can give you a working brute force method ;-) — dhke
– dhke, Commented Aug 18, 2016 at 15:42

machine yearning · Accepted Answer · 2016-08-19 15:05:57Z

You can break this into two parts. First, finding the shortest substring that matches some property. We'll pretend we already have a function that tests for the property:

def find_shortest_ss(document, some_property):
    # First level of looping gradually increases substring length
    for x in range(len(document)):
        # Second level of looping tests current length at valid positions
        for y in range(max(len(document), len(document)-x)):
            if some_property(document[y:x+y]):
                return document[y:x+y]
    # How to handle the case of no match is undefined
    raise ValueError('No matching value found')

Now the property we want to test for itself:

def contains_all_terms(terms):
    return (lambda s: all(term in s for term in terms))

This lambda expression takes some terms and will return a function which, when evaluated on a string, returns true if and only if all the terms are in the string. This is basically a more terse version of a nested function definition which you could write like this:

def contains_all_terms(terms):
    def string_contains_them(s):
        return all(term in s for term in terms)
    return string_contains_them

So we're actually just returning the handle of the function we create dynamically inside of our contains_all_terms function

To piece this together we do like so:

>>> find_shortest_ss(document, contains_all_terms(searchTerms))
'program can google'

Some efficiency advantages which this code has:

The any builtin function has short-circuit evaluation, meaning that it will return False as soon as it finds a non-contained substring
It starts by checking all the shortest substrings, then proceeds to increase substring length one extra character length at a time. If it ever finds a satisfying substring it will exit and return that value. So you can guarantee the returned value will never be longer than necessary. It won't even be doing any operations on substrings longer than necessary.
8 lines of code, not bad I think

dhke · Accepted Answer · 2016-08-18 15:45:16Z

0

Well, brute force is O(n³), so why not:

from itertools import product

def find_shortest(doc, terms):
    doc = document.split()
    substrings = (
        doc[i:j]
        for i, j in product(range(0, len(doc)), range(0, len(doc)))
        if all(search_term in doc[i:j] for search_term in search_terms)
    )
    shortest = doc
    for candidate in substrings:
        if len(candidate) < len(shortest):
            shortest = candidate
    return shortest.
document = 'many google employees can program can google employees because google is a technology company that writes program'
search_terms = ['google', 'program', 'can']

print find_shortest(document, search_terms)
>>>> ['program', 'can', 'google']

You can probably do this a lot faster, though. For example, any relevant substring can only end with one of the keywords

answered Aug 18, 2016 at 15:45

dhke

15.5k2 gold badges43 silver badges60 bronze badges

2 Comments

user6461129 Over a year ago

This isn't passing the following case, document = 'many google employees can program' searchTerms = ['google', 'program']

dhke Over a year ago

@CodeMonkey Inwhatsofar? I get ['google', 'employees', 'can']?

Aaron · Accepted Answer · 2016-08-18 16:21:27Z

0

Instead of brute forcing all possible sub-strings, I brute forced all possible matching word positions... It should be a bit faster..

import numpy as np
from itertools import product


document = 'many google employees can program can google employees because google is a technology company that writes program'
searchTerms = ['google', 'program']

word_lists = []

for word in searchTerms: 
    word_positions = []
    start = 0  #starting index of str.find()
    while 1:
        start = document.find(word, start, -1)
        if start == -1:  #no more instances
            break
        word_positions.append([start, start+len(word)])  #beginning and ending index of search term
        start += 1  #increment starting search postion
    word_lists.append(word_positions)  #add all search term positions to list of all search terms

minLen = len(document)
lower = 0
upper = len(document)
for p in product(*word_lists):  #unpack word_lists into word_positions
    indexes = np.array(p).flatten()  #take all indices into flat list
    lowerI = np.min(indexes)
    upperI = np.max(indexes)
    indexRange = upperI - lowerI  #determine length of substring
    if indexRange < minLen: 
        minLen = indexRange
        lower = lowerI
        upper = upperI

print document[lower:upper]

answered Aug 18, 2016 at 16:21

Aaron

11.2k1 gold badge27 silver badges43 bronze badges

1 Comment

machine yearning Over a year ago

Personally I wouldn't involve numpy here, it seems out of place. Numpy is built for fast operations on numerical data sets in matrices.

Collectives™ on Stack Overflow

Suffix search - Python

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related