regular expression \b in java and javascript

Question

Is there any difference of use regular expression \b in java and js?
I tried below test:
in javascript:

console.log(/\w+\b/.test("test中文"));//true

in java:

String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
    System.out.println("matched");//never executed
}

Why the result of the two example above are not same?

Javascript regexes don't understand unicode.

georg
– georg

2015-05-24 15:37:48 +00:00
Commented May 24, 2015 at 15:37 — georg
– georg, Commented May 24, 2015 at 15:37

Pshemo · Accepted Answer · 2015-05-24 16:39:25Z

3

That is because by default Java supports Unicode for \b but not for \w, while JavaScript doesn't support Unicode for both.

So \w can only match [a-zA-Z0-9_] characters (in our case test) but \b can't accept place (marked with |)

test|中文

as between alphabetic and non-alphabetic Unicode standards because both t and 中 are considered alphabetic characters by Unicode.

If you want to have \b which will ignore Unicode you can use look-around mechanism and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), or in case of this example simple (?!\\w) instead of \\b will also work.

If you want \w to also support Unicode compile your pattern with Pattern.UNICODE_CHARACTER_CLASS flag (which can also be written as flag expression (?U))

edited May 24, 2015 at 16:39

answered May 24, 2015 at 16:10

Pshemo

125k26 gold badges194 silver badges280 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ruakh Over a year ago

+1. Note that you don't usually want or need to write all of (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), since in almost any context, some part of it will be redundant. In the OP's case, the non-Unicode version of \\b is just (?!\\w).

nhahtdh Over a year ago

Additionally, in Java, \b is not sync with definition of \w. \b by default is _ plus Character.isLetterOrDigit() (which is Unicode aware, but this implementation is incorrect), and in UNICODE_CHARACTER_CLASS mode, will sync with definition of \w.

laune · Accepted Answer · 2015-05-24 15:43:30Z

1

The Jeva regex looks for a sequence of word characters, i.e. [a-zA-Z_0-9]+ preceding a word boundary. But 中文 doesn't fit \w. If you use \\b alone, you'll find two matches: begin and end of the string.

As has been pointed out by georg, Javascript isn't interpreting characters the same way as Java's Regex engine.

answered May 24, 2015 at 15:43

laune

31.3k3 gold badges32 silver badges44 bronze badges

2 Comments

RealSkeptic Over a year ago

That's actually strange, because a word boundary is supposed to be on the boundary between \w and \W. Since test matches \w+ and 中文 matches \W+ there should have been a \b there.

laune Over a year ago

@RealSkeptic Java's regex engine has had problems in earlier releases as well. Here, text matches "\\w+\\W+", but not "\\w+\\b", which is contrary to elementary logic. (1.8.0_20)

Collectives™ on Stack Overflow

regular expression \b in java and javascript

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related