Remove substrings inside a list with better than O(n^2) complexity

Question

I have a list with many words (100.000+), and what I'd like to do is remove all the substrings of every word in the list.

So for simplicity, let's imagine that I have the following list:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']

The following output is the desired:

['Hello', 'Apple', 'Banana', 'Peter']

'Hell' was removed because it is a substring of 'Hello'
'Ban' was removed because it is a substring of 'Banana'
'P' was removed because it is a substring of 'Peter'
'e' was removed because it is a substring of 'Hello', 'Hell', 'Apple', and so on.

What I've done

This is my code, but I am wondering if there is a more efficient way than these nested comprehensions.

to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]

How can I improve the performance? Should I use regex instead?

you could use a lambda as a filter. See stackoverflow.com/questions/33944647/… — v2v1
– v2v1, Commented Mar 28, 2018 at 15:23
Iterate on words while updating a set of all (unique) substrings, then skip words when they are in this set. — pawamoy
– pawamoy, Commented Mar 28, 2018 at 15:34
Related: Ukkonen's algorithm. Answerers, please refrain from adding yet another answer with a slightly different way of doing this in O(n^2). — wim
– wim, Commented Mar 28, 2018 at 15:53
Since the OP has already an O(n^2) solution, and all the solutions on the proposed dupe are bad, that other question unlikely to help them - this should not be closed. — wim
– wim, Commented Mar 28, 2018 at 16:34
@Aran-Fey The other question asks for a simple way to do it. This question asks for a more efficient approach, and has the algorithm tag. I don't think they are duplicates either way. Folks can golf their favorite O(n^2) over on the other question. — miradulo
– miradulo, Commented Mar 28, 2018 at 17:50

btilly · Accepted Answer · 2018-03-28 17:22:43Z

4

@wim is correct.

Given an alphabet of fixed length, the following algorithm is linear in the overall length of text. If the alphabet is of unbounded size, then it will be O(n log(n)) instead. Either way it is better than O(n^2).

Create an empty suffix tree T.
Create an empty list filtered_words
For word in words:
    if word not in T:
        Build suffix tree S for word (using Ukkonen's algorithm)
        Merge S into T
        append word to filtered_words

answered Mar 28, 2018 at 17:22

btilly

47.8k3 gold badges70 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pawamoy · Accepted Answer · 2018-03-28 16:04:15Z

2

Build the set of all (unique) substrings first, then filter the words with it:

def substrings(s):
    length = len(s)
    return {s[i:j + 1] for i in range(length) for j in range(i, length)} - {s}


def remove_substrings(words):
    subs = set()
    for word in words:
        subs |= substrings(word)

    return set(w for w in words if w not in subs)

edited Mar 28, 2018 at 16:04

answered Mar 28, 2018 at 15:44

pawamoy

3,9241 gold badge30 silver badges51 bronze badges

4 Comments

wim Over a year ago

You are on the right track but you have a failure mode (e.g. input ['ab', 'abc'] will collect 'ab' in the result.

pawamoy Over a year ago

Oh yes indeed. Then we must build the substrings set before looping on words again. Thanks.

pawamoy Over a year ago

You mean duplicates in result? Then a set comprehension instead of list comprehension haha :)

btilly Over a year ago

If one of the words is very long, there will be O(n^2) substrings, and then the algorithm won't be as fast as asked. However this is a good solution for a very long list of small words.

Yasin Yousif · Accepted Answer · 2018-03-28 18:18:38Z

0

Note that using for is slow in python in general, (you may use numpy arrays or NLP package), aside from that, how about this:

words = list(set(words))#elimnate dublicates
str_words = str(words)
r=[]
for x in words:
    if str_words.find(x)!=str_words.rfind(x):continue
    else:r.append(x)
print(r)

as I am answering here, I don't see a reason why c++ wouldn't be an option

edited Mar 28, 2018 at 18:18

answered Mar 28, 2018 at 15:49

Yasin Yousif

9679 silver badges23 bronze badges

7 Comments

Aran-Fey Over a year ago

My mind is blown. This is actually the fastest solution by a long shot. In my tests, this is showing to be 4 times as fast as the 2nd best solution posted here. (Of course it doesn't work well if any of the input words contain quotes or any other characters that'll be escaped by repr, but as long as the input is limited to letters, I don't see any reason why this wouldn't work.)

wim Over a year ago

A solution that only works with a subset of input is not a solution at all. –1.

wim Over a year ago

Also has a bug, try with input ['a', 'b', 'a'].

Aran-Fey Over a year ago

True, it doesn't work if there are duplicate words in the input. That's easily fixed with a set call though.

Yasin Yousif Over a year ago

@Aran-Fey it worked when quations is used (even if it appeared like: \' or /" it still gives true, .. , I wander if the OP thinks it has bugs in some cases!! (on his list)

|

Ajax1234 · Accepted Answer · 2018-03-28 15:24:16Z

-1

You can sort your data by length, and then use a list comprehension:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']
new_words = sorted(words, key=len, reverse=True)
final_results = [a for i, a in enumerate(new_words) if not any(a in c for c in new_words[:i])]

Output:

['Banana', 'Hello', 'Apple', 'Peter']

answered Mar 28, 2018 at 15:24

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

2 Comments

Prune Over a year ago

Will the slicing at new_words[:i] not slow down the operation? It has to build a new list, I believe.

Aran-Fey Over a year ago

If the slicing is really a problem, it can be replaced with itertools.islice.

Collectives™ on Stack Overflow

Remove substrings inside a list with better than O(n^2) complexity

4 Answers 4

Comments

4 Comments

7 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related