Split returns PatternSyntaxException: Illegal/unsupported escape sequence

Question

When a user input contains Unicode characters (e.g. ‘ or ” ), the following action fails:

String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");

I've tried debugging the split method, but I haven't found the root cause. I have a hunch it has something to do with the question mark (?) in the expression.

I've also tried an online java regex tool and applied the expression on some text with the following characters ‘”. It didn't show any error.

I've also tried writing a simple test method in online java compiler where I passed a test string with the ‘” characters and performed the above-mentioned split. No error either.

Code:

String answerText = uiq.getAnswerText();
            if (answerText.matches("[\\x00-\\x7F]*")) //if the answerString consists only of ascii characters we encode it
                sb.append("<String name=\"answerText\">")
                        .append(wrapCdata(uiq.isDate() ? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat) : answerText)).append("</String>");
            else { //if the answerString consists of unicode characters we encode only the Linebreakers (the \R)
                String answerNonEscapedText = "";
                String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");//This regex splits the string to its linebreak-delimiters, including them. i.e. ("$$$\r\n" ---> [$,$,$,\r\n])
                for (String str : unicodeStrings) {
                    if (str.matches("\\R"))
                        str = StringEscapeUtils.escapeJava(str);

                    answerNonEscapedText += str;
                }

Error:

java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 6 
((?<=\R)|(?=\R)) 
 ^ 
 at java.util.regex.Pattern.error(Pattern.java:1924) 
 at java.util.regex.Pattern.escape(Pattern.java:2416) 
 at java.util.regex.Pattern.atom(Pattern.java:2164) 
 at java.util.regex.Pattern.sequence(Pattern.java:2046) 
 at java.util.regex.Pattern.expr(Pattern.java:1964) 
 at java.util.regex.Pattern.group0(Pattern.java:2807) 
 at java.util.regex.Pattern.sequence(Pattern.java:2018) 
 at java.util.regex.Pattern.expr(Pattern.java:1964) 
 at java.util.regex.Pattern.group0(Pattern.java:2854) 
 at java.util.regex.Pattern.sequence(Pattern.java:2018) 
 at java.util.regex.Pattern.expr(Pattern.java:1964) 
 at java.util.regex.Pattern.compile(Pattern.java:1665) 
 at java.util.regex.Pattern.<init>(Pattern.java:1337) 
 at java.util.regex.Pattern.compile(Pattern.java:1022) 
 at java.lang.String.split(String.java:2313) 
 at java.lang.String.split(String.java:2355)

Could you please help me finding the root cause of the failure?

"((?<=\\R)|(?=\\R))" contains both a look-ahead and look-behind 0 width match, and that as group. "\\R" for newline should suffice. You will not get the line separating chars ("\n", "\r\n", "\r" or "\u0085") though. look-behind and ahead probably were done for receiving the line separator and maybe for the last line. — Joop Eggen
– Joop Eggen, Commented Jul 25, 2019 at 13:44

Joop Eggen · Accepted Answer · 2019-07-25 14:10:41Z

1

        String answerText = uiq.getAnswerText();
        if (answerText.matches("[\\x00-\\x7F]*")) {
            sb.append("<String name=\"answerText\">")
              .append(wrapCdata(uiq.isDate()
                      ? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat)
                      : answerText))
              .append("</String>");
        } else {
            String[] unicodeStrings = answerText.split("\\R"); // Splits on linebreaks.
            // This looses the exact line delimiter.
            String answerNonEscapedText = ""; // Better StringBuilder too.
            for (String str : unicodeStrings) {
                answerNonEscapedText += str + "\\r\\n";
            }

For some cases the above loss of the original line delimiters is important: there exists CSV where a field value may contain line separators \n whereas the line ends in \r\n. Or such.

A simpler solution:

        // Java >= 9
        String answerText = Pattern.compile("\\R").matcher(uiq.getAnswerText())
            .replaceAll(mr -> StringEscapeUtils.escapeJava(mr.group()));


        // Java < 9 (only for \r and \n)
        String answerText = uiq.getAnswerText()
            .replace("\r", "\\r").replace("\n", "\\n");

answered Jul 25, 2019 at 14:10

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

T.J. Crowder Over a year ago

(Hanging my head at somehow having missed \R despite searching the pattern page...)

Jr. Over a year ago

Thanks for reviewing and upgrading the code, Joop. The expression was, however, correct and I'll probably not going to refactor it. The root cause was dumber than I anticipated.

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

In this case, the regex expression was not incorrect. It was, however, supported only by java 8+ and I had java 7 on my environment. An upgrade of java solved the issue.

Pattern (Java Platform SE 7)

Perl constructs not supported by this class:

Predefined character classes (Unicode character)

\h A horizontal whitespace

\H A non horizontal whitespace

\v A vertical whitespace

\V A non vertical whitespace

\R Any Unicode linebreak sequence \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

\X Match Unicode extended grapheme cluster

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jul 26, 2019 at 12:37

Jr.

1152 silver badges10 bronze badges

Collectives™ on Stack Overflow

Split returns PatternSyntaxException: Illegal/unsupported escape sequence

2 Answers 2

2 Comments

Pattern (Java Platform SE 7)

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Pattern (Java Platform SE 7)

Comments

Your Answer

Sign up or log in

Post as a guest

Related