Java regex working differently on Android than in Java

Question

I'm using Java regex on Android and I'm seeing strange differences, as the following

Java: "COSÌ".replaceAll( "\\W", "" ) ----> "COS"

Android: "COSÌ".replaceAll( "\\W", "" ) ----> "COSÌ"

Anyone noticed similar differences between Java and Android String class?

@TheLostMind: Android uses ICU regex, the last time I checked the documentation. ICU and Java are quite similar, but not the same. — nhahtdh
– nhahtdh, Commented May 12, 2015 at 8:17
@nhahtdh- Oh.. The regular expression implementation used in Android is provided by ICU.The notation for the regular expressions is mostly a superset of those used in other Java language implementations. This means that existing applications will normally work as expected, but in rare cases Android may accept a regular expression that is not accepted by other implementations. Well, this one mught be one of those rare cases :) — TheLostMind
– TheLostMind, Commented May 12, 2015 at 8:21

nhahtdh · Accepted Answer · 2015-05-13 09:40:41Z

Android

Straight from the Android documentation, right after the list of short-hand character classes (\d, \w, \s, etc.):

Note that these built-in classes don't just cover the traditional ASCII range. For example, \w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].

This would also explain why Ì is not replaced for the same code running on Android version.

While it is correct that the short-hand character classes also match Unicode character, the sample definition of \w Android documentation is way outdated. See Appendix for more details.

Java SE

In contrast, in Java SE, by default, \w is equivalent to [a-zA-Z_0-9].

\w only matches Unicode word character when Pattern.UNICODE_CHARACTER_CLASS flag is specified. When the flag is specified:

In Java 7, \w has the same definition as [\p{IsAlphabetic}\p{M}\p{Nd}\p{Pc}]
In Java 8, \w is updated to [\p{IsAlphabetic}\p{M}\p{Nd}\p{Pc}\u200c\u200d]

Workaround

Specify the character class directly. ICU regex doesn't support ASCII character class:

[^a-zA-Z0-9_]

Appendix

Definition of `\w` in ICU

Here is the how the \w has evolved over time:

The short-hand character class \w was defined as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}] (as shown in the documentation) up to ICU 3.0.
From ICU 3.2 (released on 2006/02/24) and up to and including ICU 4.8.1.1, [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}] (equivalent to [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}] in the source code) is used instead. Changed at revision 16634
From ICU 49 (released on 2012/06/06), the current definition in the documentation is used [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d] (equivalent to [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\u200c\u200d] in the source code). Changed at revision 31278.

The string above is used to construct URX_ISWORD_SET, which is used in regcmp.cpp in doBackslashW to compile the regex.

ICU version used by Android

Even at android-1.6_r1 (Donut), when Pattern class documentation is barren, it is already using ICU 3.8. The source code shows that it is using the definition from the second bullet point.

The documentation probably falls back to describe the behavior of the oldest version of Android.

Reference

If you want to navigate around the source code of Android yourself:

libcore (Java Class Library)
- From android-1.6_r1 up to android-2.2.3_r2.1, platform/dalvik repository. Pattern class can be located at libcore/regex/src/main/java/java/util/regex/Pattern.java
- From android-2.3_r1 to now, platform/libcore repository. Pattern class can be located at /luni/src/main/java/java/util/regex/Pattern.java
icu4c (ICU library for C)
- From android-1.6_r1 up to android-4.4.4_r2.0.1, platform/external/icu4c repository. Regex related stuffs can be found in i18n, Unicode related stuffs can be found in common.
- From android-5.0.0_r1 to now, platform/external/icu. Enter icu4c/source, then similar path as above.

Wiktor Stribiżew · Accepted Answer · 2015-05-12 11:10:19Z

1

Have a look at Android Regular expression syntax documentation:

Note that these built-in classes don't just cover the traditional ASCII range. For example, \w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. For more details see Unicode TR-18, and bear in mind that the set of characters in each class can vary between Unicode releases. If you actually want to match only ASCII characters, specify the explicit characters you want; if you mean 0-9 use [0-9] rather than \d, which would also include Gurmukhi digits and so forth.

Thus, use a range to make sure you only match English letters replaceAll("[^a-zA-Z0-9_]", "").

edited May 12, 2015 at 11:10

answered May 12, 2015 at 8:14

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

4 Comments

nhahtdh Over a year ago

If you want to match Unicode case-insensitively in Java SE, UNICODE_CASE must be used with CASE_INSENSITIVE flag; otherwise, only ASCII range characters are matched case insensitively. This has nothing to do with how character class \w behaves.

Wiktor Stribiżew Over a year ago

@nhahtdh: You downvote my answer because of explanation and provide the same solution? I would just comment without answering in such a case.

nhahtdh Over a year ago

The solution is the same (it's extremely basic regex), but the explanation is completely different (before your edit), and I will downvote for providing a wrong explanation.

Wiktor Stribiżew Over a year ago

@nhahtdh: I really like your consistency that some of my downvoters usually lack. Thank you for healthy criticism.

Collectives™ on Stack Overflow

Java regex working differently on Android than in Java

2 Answers 2

Android

Java SE

Workaround

Appendix

Definition of `\w` in ICU

ICU version used by Android

Reference

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Android

Java SE

Workaround

Appendix

Definition of \w in ICU

ICU version used by Android

Reference

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related

Definition of `\w` in ICU