I have a list with many words (100.000+), and what I'd like to do is remove all the substrings of every word in the list.
So for simplicity, let's imagine that I have the following list:
words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']
The following output is the desired:
['Hello', 'Apple', 'Banana', 'Peter']
'Hell'was removed because it is a substring of'Hello''Ban'was removed because it is a substring of'Banana''P'was removed because it is a substring of'Peter''e'was removed because it is a substring of'Hello','Hell','Apple', and so on.
What I've done
This is my code, but I am wondering if there is a more efficient way than these nested comprehensions.
to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]
How can I improve the performance? Should I use regex instead?