1

So, I have a list of regex patterns, and a list of strings, what I want to do is to say within this list of strings, are there any strings which do not match any of the regexes.

At present, I'm pulling out the regexes, and the values to be matched by the regex from two dictionaries:

I've made two lists, one of patterns, one of keys, from two dictionaries:

patterns = []
keys = []
for pattern, schema in patternproperties.items():
    patterns.append(pattern)
for key, value in value_obj.items():
    keys.append(key)

# Now work out if there are any non-matching keys

for key in keys:
    matches = 0
    for pattern in patterns:
        if re.match(pattern, key):
            matches += 1
    if matches == 0:
        print 'Key %s matches no patterns' %(key)

But this seems horribly inefficient. Anyone have any pointers to a better solution to this?

4
  • A simple improvement is to break out of the loop once you've found a regex that matches the key. Commented Jul 12, 2013 at 18:31
  • Are you sure you want to use re.match? search() vs. match() Commented Jul 12, 2013 at 18:35
  • 2
    Your patterns list is completely useless. Simply iterate over patternproperties dictionary. Commented Jul 12, 2013 at 18:39
  • 1
    Similarly: for pattern, schema in patternproperties.items(): patterns.append(pattern) does exactly the same thing as patterns = patternproperties.keys(), just less obviously, more verbosely, and probably slower to boot. And likewise for keys. It's just value_obj.keys(). And, as Bakuriu points out, looping over a dictionary is the same as looping over its keys. Commented Jul 12, 2013 at 18:46

3 Answers 3

3

Regexps are optimized for searching large blocks of text, not sequences of small blocks. So, you may want to consider searching '\n'.join(keys) instead of searching each one separately.

Or, alternatively, instead of moving the loops from Python to regexp, move the implicit "or"/"any" bit from Python to regexp:

pattern = re.compile('|'.join('({})'.format(p) for p in patterns))    
for key in keys:
    if not pattern.match(key):
        print 'Key %s matches no patterns' %(key)

Also, note that I used re.compile. This may not help, because of the automagic regexp caching… but it never hurts, and it often makes the code easier to read, too.


From a quick timeit test, with a shortish list of keys, and different numbers of simple patterns:

patterns   original   alternation
2          76.1 us    42.4 us
3          109 us     42.5 us
4          143 us     43.3 us

So, we've gone from linear in the number of patterns, to nearly constant.

Of course that won't hold up with much more complex patterns, or too many of them.

Sign up to request clarification or add additional context in comments.

6 Comments

Good idea. This will be way faster, reducing the number of iterations from inputCount * patternCount to inputCount + patternCount.
@fatcat1111: I assume you mean the second alternative. That's not really true. Even in the best case, where all of the patterns are completely distinct, there is an O(N log M) term, it's just got a very small coefficient compared to the O(M) term. But if the patterns start sharing possible prefixes, it can become superlinear on M. And if you throw in lookbehinds and the like, it can even push a non-exponential expression into exponential. So, it's often not too much worse than N+M, but the actual complexity is… well, unbounded, I guess.
thanks for the analysis, I truly appreciate your taking the time to explain that. Unfortunately I don't completely follow (sorry). My thinking was that you need to walk through all of the patterns once (your list comprehension), and walk through all of the keys once (your for loop), so the total complexity would be O(n+m). But you're saying that there's a logarithmic term? Is that from the evaluation of the patterns themselves? Thanks again for taking to time to explain this.
The O(m) to concatenate the patterns (and compile the result) is pretty small; it's the match part that's dependent (in a complicated way) on both n and m. If you look at the NFA that the combined regex results in, it's easier to understand, but still not trivial (unless your alternations are completely independent simple patterns).
I think I see. So your analysis considers not only the number of patterns and the number of input strings, but also the number and complexity of "sub-patterns" that constitute the compiled pattern, correct? If so, that seems like a more meaningful analysis than my naive one. Thank you!
|
2
[key for key in keys if not any(re.match(pattern, key) for pattern in patterns)]

5 Comments

Can you explain why this is more efficient?
@EmilSit any short-circuits when it finds a pattern that matches. OP's code doesn't.
Also, looping in a listcomp is a bit faster than an explicit Python loop. But that's a micro-optimization that isn't worth worrying about until you've eliminated the bigger problems.
And I guess this can be tidied up massively with avoiding either of the first two list-creations: [key for key in value_obj if not any(re.match(pattern, key) for pattern in patternproperties)]. Thanks for the tip!
@Ignacio Please try to explain code rather than just giving answers so that others may learn from it and it'll be more applicable.
0

You can optimize this in a number of ways. The basic algorithm is reasonable, so you have some choices:

  • Break out of the loop early if something matches (instead of counting the number of matches, which you don't care about).
  • Cache the compilation of the regular expression (if you have a lot of patterns).
  • Order the regular expressions so that the ones that will tend to match quickly come first. That way will give your early termination the most bang for the buck.
  • Use a list comprehension, which may be faster than manually iterating (and may one day allow the Python interpreter to parallelize, though it probably doesn't today). It's not necessarily easier to read though. (See is it better to use list comprehensions or for each loops for some opinions.)

A different algorithm might be to iterate first over the patterns and remove things from the set of potential keys as soon as one pattern matches. Something like:

remainder = set(keys)
for pattern in patterns:
    toremove = set()
    for key in remainder:
        if re.match(pattern, key):
            toremove.add(key)
    remainder -= toremove

which might be helpful if you have a pattern that matches a lot of keys.

You should of course measure for your situation and inputs to determine what optimizations are most appropriate.

2 Comments

Ordering the patterns may be a really complex task(think if the patterns are entered by the user or from some config file etc.). Self-organizing lists may be useful here(when a pattern matches it's taken to the top of the list), but this matters only if there are a lot of patterns.
@johnthexiii: No it doesn't. Here is some evidence. (That's with CPython 3.3.1. With PyPy 2.0b, the times are 2.26us and 2.28us.) See PySequence_Fast and PySequence_Fast_GET_ITEM for a hint at why, or look at the source for full details.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.