2

I am having trouble understanding the output of this regular expression. I am using the following regex to find a dates in text:

^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$

It appears to be matching the pattern within text correctly, but I'm confused by the return values.

For this test string:

TestString = "10-20-2015"

It's returning this:

[('10', '20', '', '')]

If I put () around the entire regex, I get this returned:

[('10-20-2015', '10', '20', '', '')]

I would expect it to simply return the full date string, but it appears to be breaking the results up and I don't understand why. Wrapping my regex in () returns the full date string, but it also returns 4 extra values.

How do I make this ONLY match the full date string and not small parts of the string?

from my console:

Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> TestString = "10-20-2015"
>>> re.findall(pattern, TestString, re.I)
[('10', '20', '', '')]
>>> pattern = "(^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$)"
>>> re.findall(pattern, TestString, re.I)
[('10-20-2015', '10', '20', '', '')]
>>> 
>>> TestString = "10--2015"
>>> re.findall(pattern, TestString, re.I)
[]
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> re.findall(pattern, TestString, re.I)
[]

Based on the the response, here was my answer: ((?:(?:1[0-2]|0[1-9])-(?:3[01]|[12][0-9]|0[1-9])|(?:3[01]|[12][0-9]|0[1-9])-(?:1[0-2]|0[1-9]))-(?:[0-9]{2})?[0-9]{2})

2 Answers 2

2

Every () is a captured group, (1[0-2]|0?[1-9]) captures 10, (3[01]|[12][0-9]|0?[1-9]) captures 20, and so on. When you surround everything in (), it came before the other () and matched everything. You can ignore a captured group, which is called non-captured group, use (?:) instead of ().

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! I knew I was missing something. Modified to this and it's working now: ((?:(?:1[0-2]|0[1-9])-(?:3[01]|[12][0-9]|0[1-9])|(?:3[01]|[12][0-9]|0[1-9])-(?:1[0-2]|0[1-9]))-(?:[0-9]{2})?[0-9]{2})
0

We can do that using one of the most important re functions - search(). This function scans through a string, looking for any location where this RE matches.

import re

text = "10-20-2015"

date_regex = '(\d{1,2})-(\d{1,2})-(\d{4})'

""" 
\d in above pattern stands for numerical characters [0-9].
The numbers in curly brackets {} indicates the count of numbers permitted.
Parentheses/round brackets are used for capturing groups so that we can treat 
multiple characters as a single unit.

"""

search_date = re.search(date_regex, text)

# for entire match
print(search_date.group())
# also print(search_date.group(0)) can be used
 
# for the first parenthesized subgroup
print(search_date.group(1))
 
# for the second parenthesized subgroup
print(search_date.group(2))
 
# for the third parenthesized subgroup
print(search_date.group(3))
 
# for a tuple of all matched subgroups
print(search_date.group(1, 2, 3))

Output for each of the print statement mentioned above:

10-20-2015
10
20
2015
('10', '20', '2015')

Hope this answer clears your doubt :-)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.