Python list of strings and list of regexes, clean way to find strings which don't match anything?

Question

So, I have a list of regex patterns, and a list of strings, what I want to do is to say within this list of strings, are there any strings which do not match any of the regexes.

At present, I'm pulling out the regexes, and the values to be matched by the regex from two dictionaries:

I've made two lists, one of patterns, one of keys, from two dictionaries:

patterns = []
keys = []
for pattern, schema in patternproperties.items():
    patterns.append(pattern)
for key, value in value_obj.items():
    keys.append(key)

# Now work out if there are any non-matching keys

for key in keys:
    matches = 0
    for pattern in patterns:
        if re.match(pattern, key):
            matches += 1
    if matches == 0:
        print 'Key %s matches no patterns' %(key)

But this seems horribly inefficient. Anyone have any pointers to a better solution to this?

A simple improvement is to break out of the loop once you've found a regex that matches the key. — GWW
– GWW, Commented Jul 12, 2013 at 18:31
Are you sure you want to use re.match? search() vs. match() — Ashwini Chaudhary
– Ashwini Chaudhary, Commented Jul 12, 2013 at 18:35
Your patterns list is completely useless. Simply iterate over patternproperties dictionary. — Bakuriu
– Bakuriu, Commented Jul 12, 2013 at 18:39
Similarly: for pattern, schema in patternproperties.items(): patterns.append(pattern) does exactly the same thing as patterns = patternproperties.keys(), just less obviously, more verbosely, and probably slower to boot. And likewise for keys. It's just value_obj.keys(). And, as Bakuriu points out, looping over a dictionary is the same as looping over its keys. — abarnert
– abarnert, Commented Jul 12, 2013 at 18:46

abarnert · Accepted Answer · 2013-07-12 18:41:01Z

3

Regexps are optimized for searching large blocks of text, not sequences of small blocks. So, you may want to consider searching '\n'.join(keys) instead of searching each one separately.

Or, alternatively, instead of moving the loops from Python to regexp, move the implicit "or"/"any" bit from Python to regexp:

pattern = re.compile('|'.join('({})'.format(p) for p in patterns))    
for key in keys:
    if not pattern.match(key):
        print 'Key %s matches no patterns' %(key)

Also, note that I used re.compile. This may not help, because of the automagic regexp caching… but it never hurts, and it often makes the code easier to read, too.

From a quick timeit test, with a shortish list of keys, and different numbers of simple patterns:

patterns   original   alternation
2          76.1 us    42.4 us
3          109 us     42.5 us
4          143 us     43.3 us

So, we've gone from linear in the number of patterns, to nearly constant.

Of course that won't hold up with much more complex patterns, or too many of them.

answered Jul 12, 2013 at 18:41

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Justin R. Over a year ago

Good idea. This will be way faster, reducing the number of iterations from inputCount * patternCount to inputCount + patternCount.

abarnert Over a year ago

@fatcat1111: I assume you mean the second alternative. That's not really true. Even in the best case, where all of the patterns are completely distinct, there is an O(N log M) term, it's just got a very small coefficient compared to the O(M) term. But if the patterns start sharing possible prefixes, it can become superlinear on M. And if you throw in lookbehinds and the like, it can even push a non-exponential expression into exponential. So, it's often not too much worse than N+M, but the actual complexity is… well, unbounded, I guess.

Justin R. Over a year ago

thanks for the analysis, I truly appreciate your taking the time to explain that. Unfortunately I don't completely follow (sorry). My thinking was that you need to walk through all of the patterns once (your list comprehension), and walk through all of the keys once (your for loop), so the total complexity would be O(n+m). But you're saying that there's a logarithmic term? Is that from the evaluation of the patterns themselves? Thanks again for taking to time to explain this.

abarnert Over a year ago

The O(m) to concatenate the patterns (and compile the result) is pretty small; it's the match part that's dependent (in a complicated way) on both n and m. If you look at the NFA that the combined regex results in, it's easier to understand, but still not trivial (unless your alternations are completely independent simple patterns).

Justin R. Over a year ago

I think I see. So your analysis considers not only the number of patterns and the number of input strings, but also the number and complexity of "sub-patterns" that constitute the compiled pattern, correct? If so, that seems like a more meaningful analysis than my naive one. Thank you!

|

Ignacio Vazquez-Abrams · Accepted Answer · 2013-07-12 18:32:33Z

2

[key for key in keys if not any(re.match(pattern, key) for pattern in patterns)]

answered Jul 12, 2013 at 18:32

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

5 Comments

Emil Sit Over a year ago

Can you explain why this is more efficient?

Bakuriu Over a year ago

@EmilSit any short-circuits when it finds a pattern that matches. OP's code doesn't.

abarnert Over a year ago

Also, looping in a listcomp is a bit faster than an explicit Python loop. But that's a micro-optimization that isn't worth worrying about until you've eliminated the bigger problems.

jvc26 Over a year ago

And I guess this can be tidied up massively with avoiding either of the first two list-creations: [key for key in value_obj if not any(re.match(pattern, key) for pattern in patternproperties)]. Thanks for the tip!

Sled Over a year ago

@Ignacio Please try to explain code rather than just giving answers so that others may learn from it and it'll be more applicable.

Community · Accepted Answer · 2017-05-23 10:25:33Z

0

You can optimize this in a number of ways. The basic algorithm is reasonable, so you have some choices:

Break out of the loop early if something matches (instead of counting the number of matches, which you don't care about).
Cache the compilation of the regular expression (if you have a lot of patterns).
Order the regular expressions so that the ones that will tend to match quickly come first. That way will give your early termination the most bang for the buck.
Use a list comprehension, which may be faster than manually iterating (and may one day allow the Python interpreter to parallelize, though it probably doesn't today). It's not necessarily easier to read though. (See is it better to use list comprehensions or for each loops for some opinions.)

A different algorithm might be to iterate first over the patterns and remove things from the set of potential keys as soon as one pattern matches. Something like:

remainder = set(keys)
for pattern in patterns:
    toremove = set()
    for key in remainder:
        if re.match(pattern, key):
            toremove.add(key)
    remainder -= toremove

which might be helpful if you have a pattern that matches a lot of keys.

You should of course measure for your situation and inputs to determine what optimizations are most appropriate.

edited May 23, 2017 at 10:25

CommunityBot

11 silver badge

answered Jul 12, 2013 at 18:56

Emil Sit

23.7k7 gold badges56 silver badges76 bronze badges

2 Comments

Bakuriu Over a year ago

Ordering the patterns may be a really complex task(think if the patterns are entered by the user or from some config file etc.). Self-organizing lists may be useful here(when a pattern matches it's taken to the top of the list), but this matters only if there are a lot of patterns.

abarnert Over a year ago

@johnthexiii: No it doesn't. Here is some evidence. (That's with CPython 3.3.1. With PyPy 2.0b, the times are 2.26us and 2.28us.) See PySequence_Fast and PySequence_Fast_GET_ITEM for a hint at why, or look at the source for full details.

Collectives™ on Stack Overflow

Python list of strings and list of regexes, clean way to find strings which don't match anything?

3 Answers 3

6 Comments

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related