12

I have a list with many words (100.000+), and what I'd like to do is remove all the substrings of every word in the list.

So for simplicity, let's imagine that I have the following list:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']

The following output is the desired:

['Hello', 'Apple', 'Banana', 'Peter']
  • 'Hell' was removed because it is a substring of 'Hello'
  • 'Ban' was removed because it is a substring of 'Banana'
  • 'P' was removed because it is a substring of 'Peter'
  • 'e' was removed because it is a substring of 'Hello', 'Hell', 'Apple', and so on.

What I've done

This is my code, but I am wondering if there is a more efficient way than these nested comprehensions.

to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]

How can I improve the performance? Should I use regex instead?

11
  • you could use a lambda as a filter. See stackoverflow.com/questions/33944647/… Commented Mar 28, 2018 at 15:23
  • Iterate on words while updating a set of all (unique) substrings, then skip words when they are in this set. Commented Mar 28, 2018 at 15:34
  • 8
    Related: Ukkonen's algorithm. Answerers, please refrain from adding yet another answer with a slightly different way of doing this in O(n^2). Commented Mar 28, 2018 at 15:53
  • 4
    Since the OP has already an O(n^2) solution, and all the solutions on the proposed dupe are bad, that other question unlikely to help them - this should not be closed. Commented Mar 28, 2018 at 16:34
  • 2
    @Aran-Fey The other question asks for a simple way to do it. This question asks for a more efficient approach, and has the algorithm tag. I don't think they are duplicates either way. Folks can golf their favorite O(n^2) over on the other question. Commented Mar 28, 2018 at 17:50

4 Answers 4

4

@wim is correct.

Given an alphabet of fixed length, the following algorithm is linear in the overall length of text. If the alphabet is of unbounded size, then it will be O(n log(n)) instead. Either way it is better than O(n^2).

Create an empty suffix tree T.
Create an empty list filtered_words
For word in words:
    if word not in T:
        Build suffix tree S for word (using Ukkonen's algorithm)
        Merge S into T
        append word to filtered_words
Sign up to request clarification or add additional context in comments.

Comments

2

Build the set of all (unique) substrings first, then filter the words with it:

def substrings(s):
    length = len(s)
    return {s[i:j + 1] for i in range(length) for j in range(i, length)} - {s}


def remove_substrings(words):
    subs = set()
    for word in words:
        subs |= substrings(word)

    return set(w for w in words if w not in subs)

4 Comments

You are on the right track but you have a failure mode (e.g. input ['ab', 'abc'] will collect 'ab' in the result.
Oh yes indeed. Then we must build the substrings set before looping on words again. Thanks.
You mean duplicates in result? Then a set comprehension instead of list comprehension haha :)
If one of the words is very long, there will be O(n^2) substrings, and then the algorithm won't be as fast as asked. However this is a good solution for a very long list of small words.
0

Note that using for is slow in python in general, (you may use numpy arrays or NLP package), aside from that, how about this:

words = list(set(words))#elimnate dublicates
str_words = str(words)
r=[]
for x in words:
    if str_words.find(x)!=str_words.rfind(x):continue
    else:r.append(x)
print(r)

as I am answering here, I don't see a reason why c++ wouldn't be an option

7 Comments

My mind is blown. This is actually the fastest solution by a long shot. In my tests, this is showing to be 4 times as fast as the 2nd best solution posted here. (Of course it doesn't work well if any of the input words contain quotes or any other characters that'll be escaped by repr, but as long as the input is limited to letters, I don't see any reason why this wouldn't work.)
A solution that only works with a subset of input is not a solution at all. –1.
Also has a bug, try with input ['a', 'b', 'a'].
True, it doesn't work if there are duplicate words in the input. That's easily fixed with a set call though.
@Aran-Fey it worked when quations is used (even if it appeared like: \' or /" it still gives true, .. , I wander if the OP thinks it has bugs in some cases!! (on his list)
|
-1

You can sort your data by length, and then use a list comprehension:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']
new_words = sorted(words, key=len, reverse=True)
final_results = [a for i, a in enumerate(new_words) if not any(a in c for c in new_words[:i])]

Output:

['Banana', 'Hello', 'Apple', 'Peter']

2 Comments

Will the slicing at new_words[:i] not slow down the operation? It has to build a new list, I believe.
If the slicing is really a problem, it can be replaced with itertools.islice.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.