3

Is there any difference of use regular expression \b in java and js?
I tried below test:
in javascript:

console.log(/\w+\b/.test("test中文"));//true  

in java:

String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
    System.out.println("matched");//never executed
}

Why the result of the two example above are not same?

1
  • 1
    Javascript regexes don't understand unicode. Commented May 24, 2015 at 15:37

2 Answers 2

3

That is because by default Java supports Unicode for \b but not for \w, while JavaScript doesn't support Unicode for both.

So \w can only match [a-zA-Z0-9_] characters (in our case test) but \b can't accept place (marked with |)

test|中文

as between alphabetic and non-alphabetic Unicode standards because both t and are considered alphabetic characters by Unicode.

If you want to have \b which will ignore Unicode you can use look-around mechanism and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), or in case of this example simple (?!\\w) instead of \\b will also work.

If you want \w to also support Unicode compile your pattern with Pattern.UNICODE_CHARACTER_CLASS flag (which can also be written as flag expression (?U))

Sign up to request clarification or add additional context in comments.

2 Comments

+1. Note that you don't usually want or need to write all of (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), since in almost any context, some part of it will be redundant. In the OP's case, the non-Unicode version of \\b is just (?!\\w).
Additionally, in Java, \b is not sync with definition of \w. \b by default is _ plus Character.isLetterOrDigit() (which is Unicode aware, but this implementation is incorrect), and in UNICODE_CHARACTER_CLASS mode, will sync with definition of \w.
1

The Jeva regex looks for a sequence of word characters, i.e. [a-zA-Z_0-9]+ preceding a word boundary. But 中文 doesn't fit \w. If you use \\b alone, you'll find two matches: begin and end of the string.

As has been pointed out by georg, Javascript isn't interpreting characters the same way as Java's Regex engine.

2 Comments

That's actually strange, because a word boundary is supposed to be on the boundary between \w and \W. Since test matches \w+ and 中文 matches \W+ there should have been a \b there.
@RealSkeptic Java's regex engine has had problems in earlier releases as well. Here, text matches "\\w+\\W+", but not "\\w+\\b", which is contrary to elementary logic. (1.8.0_20)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.