Regex/Python - why is non capturing group captured in this case?

Question

Each element of this raw data array is parsed by regex

['\r\n\t\t\t\t\t\t', 
 'Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:', 
 ' 12:00 pm to 03:30 pm & 07:00 pm to 12:00 am\t\t\t\t\t',      
 '\r\n\t\t\t\t\t\t', 
 'Sunday:', 
 ' 12:00 pm to 03:30 pm & 07:00 pm to 12:30 am\t\t\t\t\t']

This is my regex (\\r|\\n|\\t)|(?:\D)(\:)

https://regex101.com/r/fV7wI2/1

Please note that I'm trying to match the : after Saturday but not the : in Time formats eg 12:00

Although the above image classifies capturing/non capturing groups properly

on running re.sub("(\\r|\\n|\\t)|(?:\D)(\:)",'',"Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:")

returns

'Monday, Tuesday, Wednesday, Thursday, Friday, Saturda' (missing 'y' after saturday)

instead of

'Monday, Tuesday, Wednesday, Thursday, Friday, Saturday'

why is this so?

Wiktor Stribiżew · Accepted Answer · 2015-08-28 08:47:47Z

2

You need to use a look-behind instead of a non-capturing group if you want to check a substring for presence/absence, but exclude it from the match:

import re
s = "Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:"
print(re.sub(r"[\r\n\t]|(?<!\d):",'',s))
#                         ^^^^^^^ 
# Result: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday

See IDEONE demo

Here, (?<!\d) only checks if the preceding character before a colon is not a digit.

Also, alternation involves additional overhead, character class [\r\n\t] is preferable, and you do not need any capturing groups (round brackets) since you are not using them at all.

Also, please note that the regex is initialized with a raw string literal to avoid overescaping.

Some more details from Python Regular Expression Syntax regarding non-capturing groups and negative look-behinds:

(?<!...)
- Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?:...)
- A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

As look-behinds are zero-width assertions (=expressions returning true or false without moving the index any further in the string), they are exactly what you need in this case where you want to check but not match. A non-capturing group will consume part of the string and thus will be part of the match.

edited Aug 28, 2015 at 8:47

answered Aug 28, 2015 at 8:26

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wiktor Stribiżew Over a year ago

I think you misunderstood the term "non-capturing". It just means it won't make a separate captured group for the subtext matched with the group, but the subtext will still be part of the match. I have updated my answer and refined the regex pattern. Please check and let me know if you need more clarifications. I guess Lookarounds Stand their Ground is a must-read for you.

vks · Accepted Answer · 2015-08-28 08:30:52Z

1

\D is non digit.In saturday y is non digit so it is being deleted.

Use

print re.sub("(\\r|\\n|\\t)|(?<=\D):",'',"Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:")

Using lookahead will ensure you dont deleted extra characters before :/

answered Aug 28, 2015 at 8:30

vks

68.1k11 gold badges96 silver badges132 bronze badges

3 Comments

wolfgang Over a year ago

y is non digit However its a non-capturing group (?:\D), why would it be matched?

vks Over a year ago

@wolfgang it is not capturing mean it will not from group and it will not be stored in \1.Captruing or non captruing it iwll always be matched

vks Over a year ago

@wolfgang capturing is not capture match by regex but capture groups by regex.it will just enable disable groups

fronthem · Accepted Answer · 2015-08-28 08:44:08Z

1

I think you misunderstood that (?:\D) will not considering as 1 letter in Regex, in actually it's wrong, it just doesn't capture \D into variable $1. every times you use (...), you have to realize that any pattern inside (...) will be captured into variable either $1, $2, ... in Regex.

The best way to deal with this problem is using positive/negative look ahead as answer from @vks and @stribizhev because lookaround is just an assertion which not consume any letter, this is why we call them the "zero width assertion".

edited Aug 28, 2015 at 8:44

answered Aug 28, 2015 at 8:38

fronthem

4,1418 gold badges39 silver badges60 bronze badges

Collectives™ on Stack Overflow

Regex/Python - why is non capturing group captured in this case?

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related