7

When I try to use regular expression for finding strings in other strings, it does not work as expected. Here is an example:

import re
message = 'I really like beer, but my favourite beer is German beer.'
keywords = ['beer', 'german beer', 'german']

regex = re.compile("|".join(keywords))
regex.findall(message.lower())

Result:

['beer', 'beer', 'german beer']

But the expected result would be:

['beer', 'beer', 'german beer', 'german']

Another way to do that could be:

results = []
for k in keywords:
    regex = re.compile(k)
    for r in regex.findall(message.lower()):
        results.append(r)

['beer', 'beer', 'beer', 'german beer', 'german']

It works like I want, but I think it is not the best way to do that. Can somebody help me?

3 Answers 3

7

re.findall cannot find overlapping matches. If you want to use regular expressions you will have to create separate expressions and run them in a loop as in your second example.

Note that your second example can also be shortened to the following, though it's a matter of taste whether you find this more readable:

results = [r for k in keywords for r in re.findall(k, message.lower())] 

Your specific example doesn't require the use of regular expressions. You should avoid using regular expressions if you just want to find fixed strings.

Sign up to request clarification or add additional context in comments.

4 Comments

The questioner doesn't only want to test whether a particular substring is part of a string but he wants all occurrences of a particular substring. In this case, the use of re.findall() is the best way to accomplish that. Avoiding regular expressions would make this solution more laborious than necessary.
Thanks you guys for your replies. Now I know I am using a wrong function (findall), so what do you recommend for find matches, including overlappings?
@Adrián: Do you need the power of regular expressions or do you just want to find fixed strings?
I would like to find fixed strings, but I asked about regular expressions because I was thinking it is the best way (optimal way).
6

re.findall is described in http://docs.python.org/2/library/re.html

"Return all non-overlapping matches of pattern in string..."

Non-overlapping means that for "german beer" it will not find "german beer" AND "german", because those matches are overlapping.

4 Comments

Thanks for your reply Omri Barel. What do you recommend for find matches, including overlappings?
In general you have to do what you've done: one keyword at a time. But for a better solution you'll have to describe what you're really trying to do (i.e. what is the actual situation without simplifying to trivial examples).
Omri, as I have write in the answer bellow, I asked about regular expressions because I was thinking it is the best and optimal way to do that. The strings fo find will be always fixed (word1|word2|word3...), I mean no complex regex.
If you have a lot of text to search, it may be worth looking at the Aho-Corasick string matching algorithm (en.wikipedia.org/wiki/…) which looks for a set of strings simultaneously (including overlapping matches). Otherwise, looking for one string at a time should do the trick.
1

My cleaner (for me) version for your last solution

results = []
for key in keywords:
    results.extend(re.findall(key, message, re.IGNORECASE))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.