1

A regex answers a yes or no question - does the string match the pattern.

I want to separate the no's into two categories:

  • invalid strings that are prefixes of valid strings
  • invalid strings that are not prefixes of valid

Here's an example (regular expression 01+2):

  • 012 is valid

  • 12 is invalid; it is not a prefix of a valid string

  • 01 is invalid; it is a prefix of a valid string: 012

Java can do this.

Can re do this? If not, is there a library that can make this distinction?

8
  • 1
    Define “perfectly matched”. The pattern foo matches the string The food is in the barn., for example, and it does so perfectly well. On the other hand, the pattern these fails to match that same string. Are you wanting something that stops partway through the pattern and tells you where it failed? Commented Apr 5, 2012 at 17:03
  • There I made it more precise. I want to determine whether or not a failed regex match, as I defined in my question, is a prefix of some string, or not, in the language generated by the regular expression. Commented Apr 5, 2012 at 17:11
  • Don’t use match. End of story. These things become trivial when you use the proper interface, and ridiculous if you don’t. And your idea of precision does not seem to correspond with my own. Commented Apr 5, 2012 at 17:14
  • I made it even more precise. I believe if I made it more precise than it is already, I think I would make the question inaccessible to those without a background in automata, but who might thoroughly understand the concept of regular expressions and variations thereof in the context of practical programming languages. Commented Apr 5, 2012 at 17:33
  • s/is in the language/matches/g; s/is not in the language/doesn’t match/g Commented Apr 5, 2012 at 17:54

3 Answers 3

5

I second the recommendation of regex. The module is simply fantastic.

Here's an example of fuzzy matching with regex:

import regex

# traditional matching - three digits
r = '(?:\d\d\d)'
print regex.findall(r, '1xx22yy333zz')
## ['333']

# fuzzy matching - three digits, allow at most 2 deletions
r = '(?:\d\d\d){d<3}'
print regex.findall(r, '1xx22yy333zz')
## ['1', '22', '333']

The {d<3} part basically says "if we add one or two chars to that, that would be a match" - the same as point 3. in your question.

See http://pypi.python.org/pypi/regex for more info (look for 'Approximate "fuzzy" matching').

Sign up to request clarification or add additional context in comments.

1 Comment

This does what I want. However it isn't part of what's included in a standard Python distribution, I believe. If someone can give me an answer that only involves standard libraries, I will award that as the accepted answer. If not, then yours takes it :)
2

I can’t tell what you really want. Are you trying to understand incremental matches?

Perhaps you need to learn to use anchors properly. Just as you should never use the Java misnamed and deceptive matches method instead of its find method, you should probably eschew Python’s match method in favor of search, and for precisely the same reason.

Another possibility is to rewrite your pattern with optional portions that you can then inspect the success of.

Or perhaps you should look into the fuzzy matching support in Matthew Barnett’s replacement regex library, which you really should be using instead of the crufty old re.

I can’t tell what you’re realy asking, because you haven’t given examples of desired input and output.

EDIT

Perhaps you need nothing more complex than (?=.*(?:ab|bc)).*a?b?c?, or spaced out:

 (?x)
 (?= .* (?: ab | bc) )
 .*
 a? b? c?

If you put a, b, and c into recursible subgroups, you wouldn’t even have to repeat yourself.

4 Comments

@thg435 No, it’s not alright, because that is not fuzzy matching. Fuzzy matching involves the Levenshtein distance. Please remove the inapplicable edit. You may supply your own answer if you wish.
@tchrist: According to Wikipedia: "The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." I believe the syntax {d<3} means allow up to 2 deletions. Therefore the answer was an example of fuzzy matching, limited to 0, 1 or 2 deletions. Did you mean that since it forbade the other operations it was not an instance of pure fuzzy matching?
@WordsLikeJared Yes, the {d<3} part corresponds to fuzzy matching; I hadn’t seen it previously. However, you really need to disabuse yourself of all this ʀᴇɢᴜʟᴀʀ ʟᴀɴɢᴜᴀɢᴇ jazz, because that has nothing to do with almost any current regex syntax or system, including both Java’s and Python’s. You need to learn how to get the existing system to do whatever it is you really want to do. Modern pattern matching systems are infinitely more powerful than all this ʀᴇɢᴜʟᴀʀ ʟᴀɴɢᴜᴀɢᴇ business; just learn how to use them.
At my university they do not teach pattern matching in the programming courses. Most of my senior-level peers do not know how to use Java's Pattern class, for example, even though Java is the primary instructional language at my university. My university teaches automata, though, which is how I conceptualize things and why I used all that jargon.
2

Correct me if I am wrong, I believe the original question was: for some regex and input string if the input string is a possible match (ie. adding more characters to the end of the string could make it a match).

Thus fuzzy matching isn't an answer because:

rgx = regex.compile('(abcdef){d<5}')

str = 'bcd'

will be a match as well

Boost C++ library (Also for Java/Javascript?) provides such an option: http://www.boost.org/doc/libs/1_31_0/libs/regex/doc/partial_matches.html

So does anyone know what is the best way to achieve this in Python?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.