29

I have a string that I'm trying to validate against a few regex patterns and I was hoping since Pattern matching is available in 3.10, I might be able to use that instead of creating an if-else block.

Consider a string 'validateString' with possible values 1021102,1.25.32, string021.

The code I tried would be something like the following.

match validateString:
    case regex1:
        print('Matched regex1')
    case regex2:
        print('Matched regex2')
    case regex3:
        print('Matched regex3')

For regex 1, 2 and 3, I've tried string regex patterns and also re.compile objects but it doesn't seem to work.

I have been trying to find examples of this over the internet but can't seem to find any that cover regex pattern matching with the new python pattern matching.

Any ideas for how I can make it work?

Thanks!

6
  • 1
    Why do you think that the 3.10 structural pattern matching feature has anything to do with regex? python.org/dev/peps/pep-0634 - python.org/dev/peps/pep-0635 - python.org/dev/peps/pep-0636 Commented Jan 12, 2022 at 11:01
  • 3
    The re patterns do not have support for the match/case pattern matching. Commented Jan 12, 2022 at 11:02
  • 8
    @PatrickArtner Admittedly, it doesn't seem far-fetched that pattern matching would support regex (or generally str) patterns as well. In many languages it does work, and Python offers pattern matching support for other prominent sequence types. Commented Jan 12, 2022 at 11:06
  • 1
    @MisterMiyagi It does seem a bit strange that regex, which is used quite a lot for string patterns wouldn't be supported. I hope future updates include that though. Commented Jan 14, 2022 at 11:01
  • 3
    To me pattern matching and regex are in the same lexicon. There are many occurences of "match" and "pattern" in regex documentation, so it makes sense that structural patterna matching could be related to regular expressions. Not far fetched at all. Commented May 8, 2022 at 8:39

7 Answers 7

28

Update

I condensed this answer into a python package to make matching as easy as pip install regex-spm,

import regex_spm

match regex_spm.fullmatch_in("abracadabra"):
  case r"\d+": print("It's all digits")
  case r"\D+": print("There are no digits in the search string")
  case _: print("It's something else")

Original answer

As Patrick Artner correctly points out in the other answer, there is currently no official way to do this. Hopefully the feature will be introduced in a future Python version and this question can be retired. Until then:

PEP 634 specifies that Structural Pattern Matching uses the == operator for evaluating a match. We can override that.

import re
from dataclasses import dataclass

# noinspection PyPep8Naming
@dataclass
class regex_in:
    string: str

    def __eq__(self, other: str | re.Pattern):
        if isinstance(other, str):
            other = re.compile(other)
        assert isinstance(other, re.Pattern)
        # TODO extend for search and match variants
        return other.fullmatch(self.string) is not None

Now you can do something like:

match regex_in(validated_string):
    case r'\d+':
        print('Digits')
    case r'\s+':
        print('Whitespaces')
    case _:
        print('Something else')

Caveat #1 is that you can't pass re.compile'd patterns to the case directly, because then Python wants to match based on class. You have to save the pattern somewhere first.

Caveat #2 is that you can't actually use local variables either, because Python then interprets it as a name for capturing the match subject. You need to use a dotted name, e.g. putting the pattern into a class or enum:

class MyPatterns:
    DIGITS = re.compile('\d+')

match regex_in(validated_string):
    case MyPatterns.DIGITS:
        print('This works, it\'s all digits')

Groups

This could be extended even further to provide an easy way to access the re.Match object and the groups.

# noinspection PyPep8Naming
@dataclass
class regex_in:
    string: str
    match: re.Match = None

    def __eq__(self, other: str | re.Pattern):
        if isinstance(other, str):
            other = re.compile(other)
        assert isinstance(other, re.Pattern)
        # TODO extend for search and match variants
        self.match = other.fullmatch(self.string)
        return self.match is not None

    def __getitem__(self, group):
        return self.match[group]

# Note the `as m` in in the case specification
match regex_in(validated_string):
    case r'\d(\d)' as m:
        print(f'The second digit is {m[1]}')
        print(f'The whole match is {m.match}')
Sign up to request clarification or add additional context in comments.

1 Comment

@ahoff - I'd like to point out that python's re module caches the latest 512 patterns passed to it, so there's rarely a reason to bother with calling compile. If premature optimization was the only reason for creating your MyPatterns class example, it's premature and perhaps shouldn't be done. Having said that, maybe it's still useful to keep your code DRY if you're dealing with the same regex patterns in multiple spots...
22

Clean solution

There is a clean solution to this problem. Just hoist the regexes out of the case-clauses where they aren't supported and into the match-clause which supports any Python object.

The combined regex will also give you better efficiency than could be had by having a series of separate regex tests. Also, the regex can be precompiled for maximum efficiency during the match process.

Example

Here's a worked out example for a simple tokenizer:

pattern = re.compile(r'(\d+\.\d+)|(\d+)|(\w+)|(".*)"')
Token = namedtuple('Token', ('kind', 'value', 'position'))
env = {'x': 'hello', 'y': 10}

for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
    mo = pattern.fullmatch(s)
    match mo.lastindex:
        case 1:
            tok = Token('NUM', float(s), mo.span())
        case 2:
            tok = Token('NUM', int(s), mo.span())
        case 3:
            tok = Token('VAR', env[s], mo.span())
        case 4:
            tok = Token('TEXT', s[1:-1], mo.span())
        case _:
            raise ValueError(f'Unknown pattern for {s!r}')
    print(tok) 

This outputs:

Token(kind='NUM', value=123, position=(0, 3))
Token(kind='NUM', value=123.45, position=(0, 6))
Token(kind='VAR', value='hello', position=(0, 1))
Token(kind='VAR', value=10, position=(0, 1))
Token(kind='TEXT', value='goodbye', position=(0, 9))

Better Example

The code can be improved by writing the combined regex in verbose format for intelligibility and ease of adding more cases. It can be further improved by naming the regex sub patterns:

pattern = re.compile(r"""(?x)
    (?P<float>\d+\.\d+) |
    (?P<int>\d+) |
    (?P<variable>\w+) |
    (?P<string>".*")
""")

That can be used in a match/case statement like this:

for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
    mo = pattern.fullmatch(s)
    match mo.lastgroup:
        case 'float':
            tok = Token('NUM', float(s), mo.span())
        case 'int':
            tok = Token('NUM', int(s), mo.span())
        case 'variable':
            tok = Token('VAR', env[s], mo.span())
        case 'string':
            tok = Token('TEXT', s[1:-1], mo.span())
        case _:
            raise ValueError(f'Unknown pattern for {s!r}')
    print(tok)

Comparison to if/elif/else

Here is the equivalent code written using an if-elif-else chain:

for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
    if (mo := re.fullmatch('\d+\.\d+', s)):
        tok = Token('NUM', float(s), mo.span())
    elif (mo := re.fullmatch('\d+', s)):
        tok = Token('NUM', int(s), mo.span())
    elif (mo := re.fullmatch('\w+', s)):
        tok = Token('VAR', env[s], mo.span())
    elif (mo := re.fullmatch('".*"', s)):
        tok = Token('TEXT', s[1:-1], mo.span())
    else:
        raise ValueError(f'Unknown pattern for {s!r}')
    print(tok)

Compared to the match/case, the if-elif-else chain is slower because it runs multiple regex matches and because there is no precompilation. Also, it is less maintainable without the case names.

Because all the regexes are separate we have to capture all the match objects separately with repeated use of assignment expressions with the walrus operator. This is awkward compared to the match/case example where we only make a single assignment.

1 Comment

The main downside to the Better Example code is that match mo.lastgroup will throw an AttributeError if there isn't a match. It's an easy enough fix to check it with an if statement beforehand, but the example with only if-else statements doesn't have this problem.
12

The following example is based on R. Hettinger's talk discussing an approach similar to @ahoff's post.

Given

import re


class RegexEqual(str):
    def __eq__(self, pattern):
        return bool(re.search(pattern, self))

Code

def validate(s):
    """A naive string validator."""
    match RegexEqual(s):
        case r"\d+":
            return "Number found"
        case r"\w+":
            return "Letter found"
        case _:
            return "Unknown"

Demo

validate("123")
# 'Number found'
validate("hi")
# 'Letter found'
validate("...")
# 'Unknown'

Details

RegexEqual is a direct subclass of str that simply overrides the == operator.

RegexEqual("hello") == "h...o"
# True

See Also

  • R. Hettinger's toolkit on common match-case workarounds.

Comments

5

It is not possible to use regex-patterns to match via structural pattern matching (at this point in time).

From: PEP0643: structural-pattern-matching

PEP 634: Structural Pattern Matching
Structural pattern matching has been added in the form of a match statement and case statements of patterns with associated actions. Patterns consist of sequences, mappings, primitive data types as well as class instances. Pattern matching enables programs to extract information from complex data types, branch on the structure of data, and apply specific actions based on different forms of data. (emphasis mine)

Nothing in this gives any hint that evoking match / search functions of the re module on the provided pattern is intended to be used for matching.


You can find out more about the reasoning behind strucutral pattern matching by reading the actuals PEPs:

they also include ample examples on how to use it.

Comments

4

It's a little obvious and not at all fancy, but pretty readable -- just put it in the guard.

import re, logging
logger = logging.getLogger(__name__)
def singularize(plural:str) -> str:
    '''
    If I can make a reaonable guess at a singular form, return it

    :param plural: Candidate plural like 'Philadelphia 76ers'
    :return: Singular form if calculated, or argument
    '''
    match plural:
        case ies if form := re.fullmatch(r'(?i)(\w+)ies', ies):
            # gravies
            singular = f"{form.group(1)}y"
        case oes if form := re.fullmatch(r'(?i)(\w+o)es', oes):
            # potatoes
            singular = form.group(1)
        case xim if form := re.fullmatch(r'(?i)(\w+)im', xim):
            # chaverim
            singular = form.group(1)
        case xes if form := re.fullmatch(r'(?i)(\w+)es', xes):
            #glasses
            singular = form.group(1)
        case xxs if form := re.fullmatch(r'(?i)(\w*[A-RT-Z]s)', xxs):
            # books
            singular = form.group(1)
        case ata if form := re.fullmatch(r'(?i)(\w+at)a', ata):
            # data
            singular = f"{form.group(1)}um"
        case xxi if form := re.fullmatch(r'(?i)(\w+)i', xxi):
            # illuminati
            singular = f"{form.group(1)}us"
        case xae if form := re.fullmatch(r'(?i)(\w+a)e', xae):
            # alumnae
            singular = form.group(1)
        case default:
            # all sorts of plural forms not covered, but I'm really just trying to convert labels to label
            singular = default
    logger.debug(f"{singular} singularized from {plural}")
    return singular

Comments

1

Just a small snippet from some WIP of mine. Works like a charm.

pattern = re.compile(r"""
(?P<empty>^$)|                      # empty line
(?P<bigcomment>^(\#\#\#))|          # comment block
(?P<comment>^(\#))                  # single line comment
(?P<newdoc>^((\=\=\=)|(\-\-\-)))|   # new document
(?P<catchall>.*)
""",
re.VERBOSE,
)
for line in lines.splitlines():
    mo = pattern.search(line)  # Match object
    assert mo is not None, f"Line: {line}"  # otherwise the typechecker complains that mo could be None (which logically can't happen with the catchall)

    match mo.lastgroup:
        case "empty":continue
        case "bigcomment": ...

Comments

0

Although the OP asked for a solution using structural pattern matching, actually one can be looking for just a neat code or one in a funcional style. In that case:

patterns = ['^hof.*', r'a\w+c.*', '^[1-3].*']

def search(str):
  return 'found' if any([re.search(p, str) is not None for p in patterns]) else 'not found'

Usage:

search('hof!')      #found
search('abc')       #found
search('1234')      #found
search('spaghetti') #not found

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.