Partial Regex matching in Python 3.2

Question

A regex answers a yes or no question - does the string match the pattern.

I want to separate the no's into two categories:

invalid strings that are prefixes of valid strings
invalid strings that are not prefixes of valid

Here's an example (regular expression 01+2):

012 is valid
12 is invalid; it is not a prefix of a valid string
01 is invalid; it is a prefix of a valid string: 012

Java can do this.

Can re do this? If not, is there a library that can make this distinction?

Define “perfectly matched”. The pattern foo matches the string The food is in the barn., for example, and it does so perfectly well. On the other hand, the pattern these fails to match that same string. Are you wanting something that stops partway through the pattern and tells you where it failed? — tchrist
– tchrist, Commented Apr 5, 2012 at 17:03
There I made it more precise. I want to determine whether or not a failed regex match, as I defined in my question, is a prefix of some string, or not, in the language generated by the regular expression. — Jesus is Lord
– Jesus is Lord, Commented Apr 5, 2012 at 17:11
Don’t use match. End of story. These things become trivial when you use the proper interface, and ridiculous if you don’t. And your idea of precision does not seem to correspond with my own. — tchrist
– tchrist, Commented Apr 5, 2012 at 17:14
I made it even more precise. I believe if I made it more precise than it is already, I think I would make the question inaccessible to those without a background in automata, but who might thoroughly understand the concept of regular expressions and variations thereof in the context of practical programming languages. — Jesus is Lord
– Jesus is Lord, Commented Apr 5, 2012 at 17:33
s/is in the language/matches/g; s/is not in the language/doesn’t match/g — tchrist
– tchrist, Commented Apr 5, 2012 at 17:54

georg · Accepted Answer · 2012-04-05 17:26:57Z

5

I second the recommendation of regex. The module is simply fantastic.

Here's an example of fuzzy matching with regex:

import regex

# traditional matching - three digits
r = '(?:\d\d\d)'
print regex.findall(r, '1xx22yy333zz')
## ['333']

# fuzzy matching - three digits, allow at most 2 deletions
r = '(?:\d\d\d){d<3}'
print regex.findall(r, '1xx22yy333zz')
## ['1', '22', '333']

The {d<3} part basically says "if we add one or two chars to that, that would be a match" - the same as point 3. in your question.

See http://pypi.python.org/pypi/regex for more info (look for 'Approximate "fuzzy" matching').

answered Apr 5, 2012 at 17:26

georg

216k57 gold badges324 silver badges401 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jesus is Lord Over a year ago

This does what I want. However it isn't part of what's included in a standard Python distribution, I believe. If someone can give me an answer that only involves standard libraries, I will award that as the accepted answer. If not, then yours takes it :)

tchrist · Accepted Answer · 2012-04-05 18:03:58Z

2

I can’t tell what you really want. Are you trying to understand incremental matches?

Perhaps you need to learn to use anchors properly. Just as you should never use the Java misnamed and deceptive matches method instead of its find method, you should probably eschew Python’s match method in favor of search, and for precisely the same reason.

Another possibility is to rewrite your pattern with optional portions that you can then inspect the success of.

Or perhaps you should look into the fuzzy matching support in Matthew Barnett’s replacement regex library, which you really should be using instead of the crufty old re.

I can’t tell what you’re realy asking, because you haven’t given examples of desired input and output.

EDIT

Perhaps you need nothing more complex than (?=.*(?:ab|bc)).*a?b?c?, or spaced out:

 (?x)
 (?= .* (?: ab | bc) )
 .*
 a? b? c?

If you put a, b, and c into recursible subgroups, you wouldn’t even have to repeat yourself.

edited Apr 5, 2012 at 18:03

answered Apr 5, 2012 at 17:08

tchrist

80.7k31 gold badges135 silver badges186 bronze badges

4 Comments

tchrist Over a year ago

@thg435 No, it’s not alright, because that is not fuzzy matching. Fuzzy matching involves the Levenshtein distance. Please remove the inapplicable edit. You may supply your own answer if you wish.

Jesus is Lord Over a year ago

@tchrist: According to Wikipedia: "The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." I believe the syntax {d<3} means allow up to 2 deletions. Therefore the answer was an example of fuzzy matching, limited to 0, 1 or 2 deletions. Did you mean that since it forbade the other operations it was not an instance of pure fuzzy matching?

tchrist Over a year ago

@WordsLikeJared Yes, the {d<3} part corresponds to fuzzy matching; I hadn’t seen it previously. However, you really need to disabuse yourself of all this ʀᴇɢᴜʟᴀʀ ʟᴀɴɢᴜᴀɢᴇ jazz, because that has nothing to do with almost any current regex syntax or system, including both Java’s and Python’s. You need to learn how to get the existing system to do whatever it is you really want to do. Modern pattern matching systems are infinitely more powerful than all this ʀᴇɢᴜʟᴀʀ ʟᴀɴɢᴜᴀɢᴇ business; just learn how to use them.

Jesus is Lord Over a year ago

At my university they do not teach pattern matching in the programming courses. Most of my senior-level peers do not know how to use Java's Pattern class, for example, even though Java is the primary instructional language at my university. My university teaches automata, though, which is how I conceptualize things and why I used all that jargon.

user250765 · Accepted Answer · 2012-10-20 14:36:56Z

2

Correct me if I am wrong, I believe the original question was: for some regex and input string if the input string is a possible match (ie. adding more characters to the end of the string could make it a match).

Thus fuzzy matching isn't an answer because:

rgx = regex.compile('(abcdef){d<5}')

str = 'bcd'

will be a match as well

Boost C++ library (Also for Java/Javascript?) provides such an option: http://www.boost.org/doc/libs/1_31_0/libs/regex/doc/partial_matches.html

So does anyone know what is the best way to achieve this in Python?

answered Oct 20, 2012 at 14:36

user250765

911 gold badge1 silver badge4 bronze badges

Collectives™ on Stack Overflow

Partial Regex matching in Python 3.2

3 Answers 3

1 Comment

EDIT

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

EDIT

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related