1

I was developing a simple regex to parse part of a URL, the regex must be able to capture part of the url in a named group, there are only a few allowed characters (a-z0-9 and -) if other characters are present the regexp must fail for the given string and no capture will be done.

But as you can see on the screenshoot when the regexp find a % sign it stops, and capture the part before it (if it is longer than two chars), the result is the same without the word boundaries (\b).

I can't understand why % is acting like \n and the engine is capturing the previous chars and stopping the % is not in the allowed list of chars so it should fail for that string... or not?

I've tried in the actual PHP code as well, with the very same result.

enter image description here

EDIT 1:

Actual PHP code:

if (preg_match('/fixed_url_part/\b(?P<codename>[a-z0-9-]{2,})\b', $url, $regs)) {
    return $regs['codename'];
}
4
  • 2
    Exact code in the question would be useful. It looks though your placeholder simply looks for alphanumeric chars, which excludes %. Commented Aug 25, 2015 at 17:02
  • I edited the answer with the code, but the point is, why with % it capture the previous chars and with, for example _ on the string it fails? why is not failing with %? Commented Aug 25, 2015 at 17:16
  • 2
    Without the end anchor (as pointed out by @Halcyon) your pattern only matches "until" it finds no more matching characters. And the word \b boundary holds true when encountering %. Commented Aug 25, 2015 at 17:23
  • Thanks @mario, i did not know that % was a word boundary Commented Aug 25, 2015 at 19:14

1 Answer 1

3

You didn't tell it to match the full line. Add $ to have it match the end.

^/fixed_url_part/\b(?P<codename>[a-z0-9\-]{2,})\b$
^-- match start of line                          ^-- match end of line
Sign up to request clarification or add additional context in comments.

4 Comments

keep - also in group as OP wants.
I'd also add ^, just in case. (I guess that abc/fixed_url_part/def should fail.)
With the end of string anchor ($) it works fine, but what i want to know is why with % in the string the regexp capture part of it, when it shoud fail (as it fail if the character is _ instead of %).
I think it's because of \b (word boundary). % is considered a word boundary whereas _ is not. So % triggers the \b causing the match.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.