Is there a Pythonic way of filtering substrings of strings in a list?

Question

I have a list with strings as below.

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]

And I want the list to be filtered as ["HelloWorld", "Foo", "Bar"], because others are substrings. I can do it like this, but don't think it's fast or elegant.

def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive

Is there any fast way to do it?

i dont think there is a faster way then looping though the list twice. There are more elegant ways to write it tho. — Anu
– Anu, Commented Jan 5, 2022 at 2:57
First, presort the input list by increasing length. Testing the shorter substrings against the rest of the list will quikcly get rid of "Hello", "World", "ar" — smci
– smci, Commented Jan 5, 2022 at 4:05
And is that a realistic input? You have such tiny lists, which take almost no time, and still need something faster? — Kelly Bundy
– Kelly Bundy, Commented Jan 5, 2022 at 15:07
Without describing the larger cases, and with accepting an answer that doesn't scale well and that emphasizes benchmarks with such small input, it's not "of course" that you have much bigger input. — Kelly Bundy
– Kelly Bundy, Commented Jan 5, 2022 at 16:55

Grismar · Accepted Answer · 2022-01-05 04:09:01Z

7

How about:

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)

Counter to what was suggested in the comments:

from timeit import timeit


def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive


def filter_not_substring2a(candidates):
    return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]


def filter_not_substring2b(candidates):
    return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]


xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))

Result:

['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996

So, OP's solution is substantially faster, but filter_not_substring2b is still about 20% faster than 2a. So, putting the len comparison first doesn't save time.

For any production scenario, OP's function is probably optimal - a way to speed it up might be to bring the whole problem into C, but I doubt that would show great gains, since the logic is pretty straightforward already and I'd expect Python to do a fairly good job of it as well.

User @ming noted that OP's solution can be improved a bit:

def filter_not_substring_b(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a in b and a != b:
                break
        else:
            survive.append(a)
    return survive

This version of the function is somewhat faster, for me about 10-15%

Finally, note that this is only just faster than 2b, even though it is very similar to the optimised solution by @ming, but almost 3x slower than their solution. It's unclear to me why that would be - if anyone has fairly certain thoughts on that, please share in the comments:

def filter_not_substring_c(candidates):
    return [a for a in candidates if all(a not in b or a == b for b in candidates)]

edited Jan 5, 2022 at 4:09

answered Jan 5, 2022 at 2:57

Grismar

32.4k6 gold badges43 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Anu Over a year ago

i doubt this would be any faster.

Grismar Over a year ago

I think you're right - might be considered more "elegant", for whatever that's worth.

BrokenBenchmark Over a year ago

Shouldn't you perform the length check before performing the substring check if you're trying to use short-circuit evaluation?

Ming Over a year ago

I found that using OP's solution filter_not_substring(), replacing the content inside for loop with if a in b and a != b: break, will be ~18% even faster

Kelly Bundy Over a year ago

@Ming Another reason/improvement for slow any.

|

Collectives™ on Stack Overflow

Is there a Pythonic way of filtering substrings of strings in a list?

1 Answer 1

14 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

14 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related