2

I'm using the PyPI module regex for regex matching. It says

  • Default Unicode word boundary

    The WORD flag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to \b and \B.

But nothing seems to have changed:

>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский  ελλανικα")
['й ', ' ε']
>>> r2.findall("русский  ελλανικα")
['й ', ' ε']

I didn't observe any difference...?

8
  • The way you can tell is to use a non-Unicode regex simulation (?:(?:^|(?<=[^a-zA-Z0-9_]))(?=[a-zA-Z0-9_])|(?<=[a-zA-Z0-9_])(?:$|(?=[^a-zA-Z0-9_]))) which has no match... obviously ! Commented Sep 20, 2018 at 1:19
  • @sln no................... Python regex matches Unicode with \w correctly, and that flag only affects \b, as the docs says. I recommend you quit this argument. Commented Sep 20, 2018 at 1:22
  • Well, I guess WORD doesn't affect boundary correctly, unless you can prove it .. Commented Sep 20, 2018 at 1:25
  • For what its worth, you can see the same behaviore here regex101.com/r/0a0pfX/1 and note the default state are no flags other than global. I estimate it is using the re module, but there is a Unicode flag that does nothing, so it might be a hold over within the regex module so as not to disturb anything. Commented Sep 20, 2018 at 1:28
  • @sln regex101 isn't good for this. I specifically said I'm using a 3rd-party module instead of Python's stock re. There are differences. Commented Sep 20, 2018 at 1:32

1 Answer 1

2

The difference between with or without the WORD flag is the way word boundaries are defined.

Given this example:

import regex

t = 'A number: 3.4 :)'

print(regex.search(r'\b3\b', t))
print(regex.search(r'\b3\b', t, flags=regex.WORD))

The first will print a match while the second returns None, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\w characters (which is still Unicode alphanumeric).

In the example, 3.4 was split by python’s default word boundary since a \W character was present, the period, therefore it’s a word boundary. For Unicode word boundary, A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.

See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules

Conclusion:

They both work with Unicode or your LOCALE, but WORD flag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a \W, since “a word is defined as a sequence of word character [\w]”.

Sign up to request clarification or add additional context in comments.

4 Comments

Are you sure a word boundary is defined there ? I mean it looks like a lot of word_break properties, not to be confused with the \b syntax.
So it is word break property. I can tell you imo, it is fairly impossible to implement that in the \b construct. In the non-Unicode implementation of \b done in C, it is really a string primitive without much overhead. In Unicode implementation, once the word is defined (and it's more than alnum property, its all the minutia that underscore represents), it is much more complicated.
Yeah, I see that. I can tell you that regex implementers will not try to implement this complexity via sentences at all. I can see that guy who did regex trying it though. Have you seen some of the bizarr syntax he uses for regex, omg ...
Yeah, that’s probably why it’s not an option in the standard regex library in Python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.