11

I have a list of strings, from which I want to locate every line that has 'http://' in it, but does not have 'lulz', 'lmfao', '.png', or any other items in a list of strings in it. How would I go about this?

My instincts tell me to use regular expressions, but I have a moral objection to witchcraft.

0

3 Answers 3

14

Here is an option that is fairly extensible if the list of strings to exclude is large:

exclude = ['lulz', 'lmfao', '.png']
filter_func = lambda s: 'http://' in s and not any(x in s for x in exclude)

matching_lines = filter(filter_func, string_list)

List comprehension alternative:

matching_lines = [line for line in string_list if filter_func(line)]
Sign up to request clarification or add additional context in comments.

3 Comments

Awesome! I get to use lambda! I knew it existed for some reason!
You don't have to. lambda allows you to define the function inline instead of setting up a variable filter_func; but you could just as easily write def filter_func(s): return 'http://' in s and not any(x in s for x in exclude). Remember, functions are objects.
I would even say this is an inappropriate use of lambda. There is no reason to prefer it to a def here.
3

This is almost equivalent to F.J's solution, but uses generator expressions instead of lambda expressions and the filter function:

haystack = ['http://blah', 'http://lulz', 'blah blah', 'http://lmfao']
exclude = ['lulz', 'lmfao', '.png']

http_strings = (s for s in haystack if s.startswith('http://'))
result_strings = (s for s in http_strings if not any(e in s for e in exclude))

print list(result_strings)

When I run this it prints:

['http://blah']

1 Comment

+1 for generators. But, note that you can do this as a(n almost) one-liner: result_strings = [s for s in haystack if s.startswith('http://') and not any(e in s for e in exclude)]. It needs a line break to fit 80 columns (per most style guides), but I would argue it is slightly easier to follow than the two-generator version. timeit also reports that this is a fair bit faster, and also slightly faster than F.J's filter version (which, IMO, is the hardest to follow of the three).
2

Try this:

for s in strings:
    if 'http://' in s and not 'lulz' in s and not 'lmfao' in s and not '.png' in s:
        # found it
        pass

Other option, if you need your options more flexible:

words = ('lmfao', '.png', 'lulz')
for s in strings:
    if 'http://' in s and all(map(lambda x, y: x not in y, words, list(s * len(words))):
        # found it
        pass

2 Comments

That was my first approach. But as my list grew and the line became unwieldy, I was hoping there was a better way.
That could get out of hand if he ever wanted to extend the list of stop words. How would you change your approach? But still, +1 for simple solutions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.