1

I thought I was ok with regex - but this has me confused - I have this line in python:

dependencies = re.findall( r"-- *depends *on *([^ ]*.*[^ ]) *$", script, re.MULTILINE)    

which works really well with:

"-- depends on    b    "    -> ["b"]
"-- depends on b"           -> ["b"]
"--dependson  green things    \n-- depends on red things\nother stuff"" -> ["green things", "red things"]
"-- depends on b \n-- depends on c" -> ["b", "c"]

but doesn't work on

"-- depends on b\n-- depends on c" -> ["b\n-- depends on c"]

I get that it's going to be some weirdness about the fact that $ matches before the newline - but what I don't get is how to fix the regex?

7
  • Results look weird if the specs are not set precisely. What are your pattern requirements? Did you mean to match any chunk of chars other than whitespace at the end of line after depends on? Then try -- *depends *on *(\S+) *$. Or, even --[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]*(\S+)[^\S\r\n]*$ to support Unicode horizontal whitespace. Commented Aug 21, 2021 at 8:08
  • no - I want to match exactly as the results show - the dependency can quite happily have spaces in it.... so "-- depends on purple people eaters " should return "purple people eaters" - but of course throw way any whitespace at the beginning and end of the expression Commented Aug 21, 2021 at 8:11
  • Then it must be something like -- *depends *on *(\S(?:.*\S)?) *$, or even -- *depends *on *(.*\S) *$ Commented Aug 21, 2021 at 8:15
  • although going off what you suggest - this works: r"--[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]+([^\s\r\n]*[^\r\n]*\S)\s*$" but it's as ugly as - unicode aside, I still don't understand why original one didn't Commented Aug 21, 2021 at 8:18
  • It is evident: [^ ] matches any char other than a space, so it matches a newline char. Commented Aug 21, 2021 at 8:19

2 Answers 2

1

In Python re, re.MULTILINE option only redefines the behavior of two anchors, ^ and $, that start matching start and end of any line, not just the whole string:

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).

Next, the [^ ] negated character class matches any char other than a literal regular space char (\x20, dec. code 32). Thus, [^ ]* matches any zero or more chars other than a space (including a newline, too).

You can use

-- *depends *on *(.*\S) *$

Or, if you can have non-breaking spaces or other horizontal Unicode spaces

--[^\S\r\n]*depends[^\S\r\n]*on[^\S\r\n]*(.*\S)[^\S\r\n]*$

In Python, you can use

h = r'[^\S\r\n]'
pattern = fr'--{h}*depends{h}*on{h}*(.*\S){h}*$'

The {h}*(.*\S) part does the job: zero or more spaces are matched and consumed first, then any zero or more chars other than line break chars as many as possible (.*) + a non-whitespace char (\S) are captured into Group 1.

Sign up to request clarification or add additional context in comments.

Comments

0

It's matching the "\n" newline as "not a space" you can fix it like so for this example:

-- *depends *on *([^ \n]*.*[^ \n]) *$

You probably really wanted something like:

--\s*depends\s*on\s*(\S*.*\S)\s*$

\s means "any space type" and \S means any NOT space type.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.