What is the fastest algorithm: in a string list, remove all the strings which are substrings of another string [Python (or other language)]

Question

There is a string list, for example ["abc", "ab", "ad", "cde", "cde", "de", "def"] I would like the output to be ["abc", "ad", "cde", "def"]

"ab" was removed because it is the substring of "abc" "cde" was removed because it is the substring of another "cde" "de" was removed because it is the substring of "def"

What is the fastest algorithm?

I have a brute-force method, which is O(n^2) as follows:

def keep_long_str(str_list):
    str_list.sort(key = lambda x: -len(x))
    cleaned_str_list = []
    for element in str_list:
        element = element.lower()
        keep_element = 1
        for cleaned_element in cleaned_str_list:
            if element in cleaned_element:
                keep_element = 0
                break
            else:
                keep_element = 1
        if keep_element:
            cleaned_str_list.append(element)
    return cleaned_str_list

* Remove, sorry for the typo, I don't know all to modify the question — Grace
– Grace, Commented Apr 16, 2020 at 22:46
Click the edit link under the question. The title is in a separate text box at the top. — user3386109
– user3386109, Commented Apr 16, 2020 at 22:53
Please repeat the intro tour. You'll get better response if you show your effort: post your code, describe the complexity (such as O(n^2)), and perhaps suggest -- in general -- how it might be improved. — Prune
– Prune, Commented Apr 16, 2020 at 22:54
"What is the fastest way" often translates to "I don't know how to do this; give me some code?" — Prune
– Prune, Commented Apr 16, 2020 at 22:54
If the input list was ["cde", "de"], would "de" be removed? — user3386109
– user3386109, Commented Apr 16, 2020 at 22:56

Jack Moody · Accepted Answer · 2020-04-18 15:32:31Z

1

strings = ["abc", "ab", "ad", "cde", "cde", "de", "def"]
unique_strings = []

for s in strings: 
     if all(s not in uniq for uniq in unique_strings):
         unique_strings.append(s)

After running this code, unique_strings equals ['abc', 'cde', 'def', 'ad'].

Note: This is probably not the fastest way to do this, but it is a simple solution.

edited Apr 18, 2020 at 15:32

answered Apr 16, 2020 at 22:59

Jack Moody

1,7713 gold badges27 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Chris Charley Over a year ago

If the shorter string comes before the longer one, strings = ["ab", "abc", "ad", "cde", "cde", "de", "def"], the result is {'abc', 'ad', 'def', 'ab', 'cde'}. This could be corrected by sorting the longer strings first, strings.sort(key=len, reverse=True).

kaya3 Over a year ago

Why use a set instead of a list here? You only use the set for iterating over and adding to, so a list should be faster. By the way, this solution is also O(n^2).

Jack Moody Over a year ago

Thanks @kaya3! Updated the answer with your suggestion.

Paddy3118 · Accepted Answer · 2020-04-17 16:22:48Z

0

I looked at the answer by Jack Moody and Chris Charley and still didn't like the use of all when any could break out of the loop on the first occurrence of a super-string, so came up with this alteration:

strings = ["abc", "ab", "ad", "cde", "cde", "de", "def"]
unique_strings = []
for s in sorted(strings, reverse=True):  # Largest first 
    if not any(s in uniq for uniq in unique_strings):
        unique_strings.append(s)
print(unique_strings)  # ['def', 'cde', 'ad', 'abc']

I didn't think there was a need to sort explicitely on string len as it is part of string compares anyway. Cheers :-)

answered Apr 17, 2020 at 16:22

Paddy3118

4,78132 silver badges41 bronze badges

4 Comments

kaya3 Over a year ago

all is short-circuiting too.

Chris Charley Over a year ago

If the test list is constructed like strings = ["ab", "abc", "ad", "cde", "cde", "de", "def"], then the results will be {'abc', 'ad', 'def', 'ab', 'cde'}. ab is in the results and shouldn't be. So, I think it is still necessary to sort the bits by length.

Paddy3118 Over a year ago

Hi Chris, I used your initial order and got the same result as before with my code, as the sort makes the code not depend on the original order. When working through the loop, string "abc" is considered before string "ab" due to the sort and because "ab" is a substring, it will not be added to the result.

Chris Charley Over a year ago

Ok, I missed your sort in your post.

Collectives™ on Stack Overflow

What is the fastest algorithm: in a string list, remove all the strings which are substrings of another string [Python (or other language)]

2 Answers 2

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related