2

Each element of this raw data array is parsed by regex

['\r\n\t\t\t\t\t\t', 
 'Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:', 
 ' 12:00 pm to 03:30 pm & 07:00 pm to 12:00 am\t\t\t\t\t',      
 '\r\n\t\t\t\t\t\t', 
 'Sunday:', 
 ' 12:00 pm to 03:30 pm & 07:00 pm to 12:30 am\t\t\t\t\t']

This is my regex (\\r|\\n|\\t)|(?:\D)(\:)

https://regex101.com/r/fV7wI2/1

enter image description here

Please note that I'm trying to match the : after Saturday but not the : in Time formats eg 12:00

Although the above image classifies capturing/non capturing groups properly

on running re.sub("(\\r|\\n|\\t)|(?:\D)(\:)",'',"Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:")

returns

'Monday, Tuesday, Wednesday, Thursday, Friday, Saturda' (missing 'y' after saturday)

instead of

'Monday, Tuesday, Wednesday, Thursday, Friday, Saturday'

why is this so?

0

3 Answers 3

2

You need to use a look-behind instead of a non-capturing group if you want to check a substring for presence/absence, but exclude it from the match:

import re
s = "Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:"
print(re.sub(r"[\r\n\t]|(?<!\d):",'',s))
#                         ^^^^^^^ 
# Result: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday

See IDEONE demo

Here, (?<!\d) only checks if the preceding character before a colon is not a digit.

Also, alternation involves additional overhead, character class [\r\n\t] is preferable, and you do not need any capturing groups (round brackets) since you are not using them at all.

Also, please note that the regex is initialized with a raw string literal to avoid overescaping.

Some more details from Python Regular Expression Syntax regarding non-capturing groups and negative look-behinds:

(?<!...)
- Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?:...)
- A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

As look-behinds are zero-width assertions (=expressions returning true or false without moving the index any further in the string), they are exactly what you need in this case where you want to check but not match. A non-capturing group will consume part of the string and thus will be part of the match.

Sign up to request clarification or add additional context in comments.

1 Comment

I think you misunderstood the term "non-capturing". It just means it won't make a separate captured group for the subtext matched with the group, but the subtext will still be part of the match. I have updated my answer and refined the regex pattern. Please check and let me know if you need more clarifications. I guess Lookarounds Stand their Ground is a must-read for you.
1

\D is non digit.In saturday y is non digit so it is being deleted.

Use

print re.sub("(\\r|\\n|\\t)|(?<=\D):",'',"Monday, Tuesday, Wednesday, Thursday, Friday, Saturday:")

Using lookahead will ensure you dont deleted extra characters before :/

3 Comments

y is non digit However its a non-capturing group (?:\D), why would it be matched?
@wolfgang it is not capturing mean it will not from group and it will not be stored in \1.Captruing or non captruing it iwll always be matched
@wolfgang capturing is not capture match by regex but capture groups by regex.it will just enable disable groups
1

I think you misunderstood that (?:\D) will not considering as 1 letter in Regex, in actually it's wrong, it just doesn't capture \D into variable $1. every times you use (...), you have to realize that any pattern inside (...) will be captured into variable either $1, $2, ... in Regex.

The best way to deal with this problem is using positive/negative look ahead as answer from @vks and @stribizhev because lookaround is just an assertion which not consume any letter, this is why we call them the "zero width assertion".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.