6

I have a list with strings as below.

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]

And I want the list to be filtered as ["HelloWorld", "Foo", "Bar"], because others are substrings. I can do it like this, but don't think it's fast or elegant.

def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive

Is there any fast way to do it?

8
  • i dont think there is a faster way then looping though the list twice. There are more elegant ways to write it tho. Commented Jan 5, 2022 at 2:57
  • First, presort the input list by increasing length. Testing the shorter substrings against the rest of the list will quikcly get rid of "Hello", "World", "ar" Commented Jan 5, 2022 at 4:05
  • 1
    Can there be duplicates in the input? Commented Jan 5, 2022 at 15:03
  • 1
    And is that a realistic input? You have such tiny lists, which take almost no time, and still need something faster? Commented Jan 5, 2022 at 15:07
  • 1
    Without describing the larger cases, and with accepting an answer that doesn't scale well and that emphasizes benchmarks with such small input, it's not "of course" that you have much bigger input. Commented Jan 5, 2022 at 16:55

1 Answer 1

7

How about:

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)

Counter to what was suggested in the comments:

from timeit import timeit


def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive


def filter_not_substring2a(candidates):
    return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]


def filter_not_substring2b(candidates):
    return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]


xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))

Result:

['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996

So, OP's solution is substantially faster, but filter_not_substring2b is still about 20% faster than 2a. So, putting the len comparison first doesn't save time.

For any production scenario, OP's function is probably optimal - a way to speed it up might be to bring the whole problem into C, but I doubt that would show great gains, since the logic is pretty straightforward already and I'd expect Python to do a fairly good job of it as well.

User @ming noted that OP's solution can be improved a bit:

def filter_not_substring_b(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a in b and a != b:
                break
        else:
            survive.append(a)
    return survive

This version of the function is somewhat faster, for me about 10-15%

Finally, note that this is only just faster than 2b, even though it is very similar to the optimised solution by @ming, but almost 3x slower than their solution. It's unclear to me why that would be - if anyone has fairly certain thoughts on that, please share in the comments:

def filter_not_substring_c(candidates):
    return [a for a in candidates if all(a not in b or a == b for b in candidates)]
Sign up to request clarification or add additional context in comments.

14 Comments

i doubt this would be any faster.
I think you're right - might be considered more "elegant", for whatever that's worth.
Shouldn't you perform the length check before performing the substring check if you're trying to use short-circuit evaluation?
I found that using OP's solution filter_not_substring(), replacing the content inside for loop with if a in b and a != b: break, will be ~18% even faster
@Ming Another reason/improvement for slow any.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.