3

I'm trying to match dates using different regular expressions using named groups so that each regex returns the same group names into the DataFrame. The idea is to search the first regex, if there is no match, use the second regex and send the result to the same group/columns, and so forth. All regex have a maximum of 3 groups (month, day, year). Sometimes the order is different, sometimes there is only and , etc. Don't worry about the regex's correctness, I just want to figure out the groups problem. Sample regex's:

regex1 = '(?P<extracted>(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4}))'
regex2 = '(?P<extracted>(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?P<year>[1|2]\d{3}))'
regex3 = '(?P<extracted>(?P<year>[1|2]\d{3}))'
full_regex = f'({regex1}|{regex2}|{regex3})'
df_captured = df['original'].str.extract(full_regex)

The problem is that named groups can't be repeated. Is there a solutions without using nested if statatemnts or something uglier?

1 Answer 1

4

You may use PyPi regex since it allows using any number of identically named capturing groups. It will require the use of apply though, since the default regex library used by Pandas is re.

Example solution:

import regex

df = pd.DataFrame({'original': ['Oct 2019', 'Some 12-04-2002', '2021']})

regex1 = '(?P<extracted>(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4}))'
regex2 = '(?P<extracted>(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?P<year>[1|2]\d{3}))'
regex3 = '(?P<extracted>(?P<year>[1|2]\d{3}))'
full_regex = f'(?:{regex1}|{regex2}|{regex3})'

def extract_regex(text, pattern):
    m = regex.search(pattern, text)
    if not m:
        return pd.Series([np.NaN, np.NaN, np.NaN])
    else:
        return pd.Series([m.group("day"),m.group("month"),m.group("year")])

df_captured = df['original'].apply(lambda x: extract_regex(x, full_regex))
df_captured.columns = ['Day', "Month", "Year"]

Output:

>>> df_captured
    Day Month  Year
0  None   Oct  2019
1    04    12  2002
2  None  None  2021
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the great answer. My only doubt is about the ?:. My understanding is that those two symbols indicate to not capture that group(), but what is it used for here? I tried with and without ?: and didn't see any difference in the output. Regards
@ELECE Since you only need the values from capturing groups, we need no extra group in the resulting match data object. It saves a tiny bit of computational resources, and makes the match data object "lighter".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.