0

There is a string list, for example ["abc", "ab", "ad", "cde", "cde", "de", "def"] I would like the output to be ["abc", "ad", "cde", "def"]

"ab" was removed because it is the substring of "abc" "cde" was removed because it is the substring of another "cde" "de" was removed because it is the substring of "def"

What is the fastest algorithm?

I have a brute-force method, which is O(n^2) as follows:

def keep_long_str(str_list):
    str_list.sort(key = lambda x: -len(x))
    cleaned_str_list = []
    for element in str_list:
        element = element.lower()
        keep_element = 1
        for cleaned_element in cleaned_str_list:
            if element in cleaned_element:
                keep_element = 0
                break
            else:
                keep_element = 1
        if keep_element:
            cleaned_str_list.append(element)
    return cleaned_str_list
6
  • * Remove, sorry for the typo, I don't know all to modify the question Commented Apr 16, 2020 at 22:46
  • Click the edit link under the question. The title is in a separate text box at the top. Commented Apr 16, 2020 at 22:53
  • Please repeat the intro tour. You'll get better response if you show your effort: post your code, describe the complexity (such as O(n^2)), and perhaps suggest -- in general -- how it might be improved. Commented Apr 16, 2020 at 22:54
  • "What is the fastest way" often translates to "I don't know how to do this; give me some code?" Commented Apr 16, 2020 at 22:54
  • If the input list was ["cde", "de"], would "de" be removed? Commented Apr 16, 2020 at 22:56

2 Answers 2

1
strings = ["abc", "ab", "ad", "cde", "cde", "de", "def"]
unique_strings = []

for s in strings: 
     if all(s not in uniq for uniq in unique_strings):
         unique_strings.append(s)

After running this code, unique_strings equals ['abc', 'cde', 'def', 'ad'].

Note: This is probably not the fastest way to do this, but it is a simple solution.

Sign up to request clarification or add additional context in comments.

3 Comments

If the shorter string comes before the longer one, strings = ["ab", "abc", "ad", "cde", "cde", "de", "def"], the result is {'abc', 'ad', 'def', 'ab', 'cde'}. This could be corrected by sorting the longer strings first, strings.sort(key=len, reverse=True).
Why use a set instead of a list here? You only use the set for iterating over and adding to, so a list should be faster. By the way, this solution is also O(n^2).
Thanks @kaya3! Updated the answer with your suggestion.
0

I looked at the answer by Jack Moody and Chris Charley and still didn't like the use of all when any could break out of the loop on the first occurrence of a super-string, so came up with this alteration:

strings = ["abc", "ab", "ad", "cde", "cde", "de", "def"]
unique_strings = []
for s in sorted(strings, reverse=True):  # Largest first 
    if not any(s in uniq for uniq in unique_strings):
        unique_strings.append(s)
print(unique_strings)  # ['def', 'cde', 'ad', 'abc']

I didn't think there was a need to sort explicitely on string len as it is part of string compares anyway. Cheers :-)

4 Comments

all is short-circuiting too.
If the test list is constructed like strings = ["ab", "abc", "ad", "cde", "cde", "de", "def"], then the results will be {'abc', 'ad', 'def', 'ab', 'cde'}. ab is in the results and shouldn't be. So, I think it is still necessary to sort the bits by length.
Hi Chris, I used your initial order and got the same result as before with my code, as the sort makes the code not depend on the original order. When working through the loop, string "abc" is considered before string "ab" due to the sort and because "ab" is a substring, it will not be added to the result.
Ok, I missed your sort in your post.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.