1

I have a regex matcher that inspects web traffic in Python. The issue is that sometimes, if the page is larger than 1MB and consists of a single string, the regex takes forever to complete. I'm wondering if there is a way to set a max execution timeout?

My script reads the regexes from a file and then processes them one at a time:

def match_keywords_add_to_es(pastes):
    match_list = get_re_match_list()
    for paste in pastes:
        log.info("matching %s"%paste["key"])
        for match in match_list:
            matched = match[1].findall(paste["raw"].lower())
            if len(matched) > 0:
                try:
                    paste["keywords"] = match[0]
                    res_paste = Paste(dictionary=paste)
                    Paste.add_paste(res_paste)
                except Exception,e:
                    log.error("Failed to add the paste "+str(paste)+" with error %s"%e)

Example regexes:

secret_key:
  match: .*secret(_)?key\s*=\s*(.*)?[A-Za-z0-9\/]{40}(.*)?.*

access_key:
  match: .*access(_)?key\*=\*(.*)?[A-Z0-9]{20}(.*)?.*

example.com:
    match: .*example\.com.*
4
  • show us your code, to see how you tried :) Commented Jul 18, 2016 at 3:02
  • @MarkoMackic included Commented Jul 18, 2016 at 3:23
  • We can't help you make the regexes faster if we can't look at the regexes. But it would also help if you figured out which regexes are taking too much time. Commented Jul 18, 2016 at 3:25
  • @Kevin I added some example regexes, there are about 30 different regexes it matches at the moment, but that might expand continuously. Commented Jul 18, 2016 at 3:30

1 Answer 1

3

I'm going to focus on this part of one of your regexen:

\s*(.*)?[A-Za-z0-9\/]{40}(.*)?.*

Firstly, \s*(.*)?. The problem with this is that the \s* can regurgitate whitespace and then have it consumed by the (.*)? (and the question mark is unnecessary, too). The result is that you get a lot of completely useless backtracking if your regex gets this far but then fails to match on the rest of the line. You probably want to use something more restrictive than ., to force the \s to match as much whitespace as possible and refuse to give them up again:

\s*(\S.*|)

This matches one non-whitespace character followed by anything, or the empty string. It will not be capable of consuming whitespace which is regurgitated by the \s, so that your backtracking will be less painful.

Skipping ahead, this construction is utterly baffling: (.*)?.*. The first dot star will consume all of the remaining characters on the line, so you don't need the second one. Star can also match zero occurrences, so you don't need the question mark in this case either.

You also don't need to escape /, as Wiktor Stribiżew points out in the comments.

So the revised fragment looks like this:

\s*(\S.*|)[A-Za-z0-9/]{40}(.*)

As a final note, I would suggest making one (and only one) of the (.*) sequences lazy, because I find it difficult to reason about whether you will grab the first part of the line that matches [A-Za-z0-9\/]{40} or the last part of the line that so matches. I believe that making the second one lazy will not change the meaning of the regex, but I am not 100% certain of this, so be sure to test it.

Sign up to request clarification or add additional context in comments.

5 Comments

Python re does not support possessive quantifiers like ?+, *+, ++, {1,250}+, and there is no need escaping /.
@WiktorStribiżew: Good point. I could've sworn it did, but I can't find them anywhere on the re docs page...
@Kevin if the second .* is made lazy, it will always match an empty string, as it is not followed by any construct that forces it to match more. The greedy .* will match till the end of line.
@Sebastian: True, though we could drop a $ at the end to fix that.
Sure you could, but .* is more efficient than .*?$, as the first takes one step, the second takes one step for each character that's left in the line. Also the results might vary depending on the use of multiline and singleline flags.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.