python regex weirdness

Question

I thought I was ok with regex - but this has me confused - I have this line in python:

dependencies = re.findall( r"-- *depends *on *([^ ]*.*[^ ]) *$", script, re.MULTILINE)

which works really well with:

"-- depends on    b    "    -> ["b"]
"-- depends on b"           -> ["b"]
"--dependson  green things    \n-- depends on red things\nother stuff"" -> ["green things", "red things"]
"-- depends on b \n-- depends on c" -> ["b", "c"]

but doesn't work on

"-- depends on b\n-- depends on c" -> ["b\n-- depends on c"]

I get that it's going to be some weirdness about the fact that $ matches before the newline - but what I don't get is how to fix the regex?

Results look weird if the specs are not set precisely. What are your pattern requirements? Did you mean to match any chunk of chars other than whitespace at the end of line after depends on? Then try -- *depends *on *(\S+) *$. Or, even --[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]*(\S+)[^\S\r\n]*$ to support Unicode horizontal whitespace. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 21, 2021 at 8:08
no - I want to match exactly as the results show - the dependency can quite happily have spaces in it.... so "-- depends on purple people eaters " should return "purple people eaters" - but of course throw way any whitespace at the beginning and end of the expression — Darren Oakey
– Darren Oakey, Commented Aug 21, 2021 at 8:11
Then it must be something like -- *depends *on *(\S(?:.*\S)?) *$, or even -- *depends *on *(.*\S) *$ — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 21, 2021 at 8:15
although going off what you suggest - this works: r"--[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]+([^\s\r\n]*[^\r\n]*\S)\s*$" but it's as ugly as - unicode aside, I still don't understand why original one didn't — Darren Oakey
– Darren Oakey, Commented Aug 21, 2021 at 8:18
It is evident: [^ ] matches any char other than a space, so it matches a newline char. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 21, 2021 at 8:19

Wiktor Stribiżew · Accepted Answer · 2021-08-21 08:58:32Z

In Python re, re.MULTILINE option only redefines the behavior of two anchors, ^ and $, that start matching start and end of any line, not just the whole string:

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).

Next, the [^ ] negated character class matches any char other than a literal regular space char (\x20, dec. code 32). Thus, [^ ]* matches any zero or more chars other than a space (including a newline, too).

You can use

-- *depends *on *(.*\S) *$

Or, if you can have non-breaking spaces or other horizontal Unicode spaces

--[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]*(.*\S)[^\S\r\n]*$

In Python, you can use

h = r'[^\S\r\n]'
pattern = fr'--{h}*depends{h}*on{h}*(.*\S){h}*$'

The {h}*(.*\S) part does the job: zero or more spaces are matched and consumed first, then any zero or more chars other than line break chars as many as possible (.*) + a non-whitespace char (\S) are captured into Group 1.

sbingner · Accepted Answer · 2021-08-21 08:14:16Z

0

It's matching the "\n" newline as "not a space" you can fix it like so for this example:

-- *depends *on *([^ \n]*.*[^ \n]) *$

You probably really wanted something like:

--\s*depends\s*on\s*(\S*.*\S)\s*$

\s means "any space type" and \S means any NOT space type.

answered Aug 21, 2021 at 8:14

sbingner

1227 bronze badges

Collectives™ on Stack Overflow

python regex weirdness

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related