0

When a user input contains Unicode characters (e.g. or ), the following action fails:

String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");

I've tried debugging the split method, but I haven't found the root cause. I have a hunch it has something to do with the question mark (?) in the expression.

I've also tried an online java regex tool and applied the expression on some text with the following characters ‘”. It didn't show any error.

I've also tried writing a simple test method in online java compiler where I passed a test string with the ‘” characters and performed the above-mentioned split. No error either.

Code:

String answerText = uiq.getAnswerText();
            if (answerText.matches("[\\x00-\\x7F]*")) //if the answerString consists only of ascii characters we encode it
                sb.append("<String name=\"answerText\">")
                        .append(wrapCdata(uiq.isDate() ? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat) : answerText)).append("</String>");
            else { //if the answerString consists of unicode characters we encode only the Linebreakers (the \R)
                String answerNonEscapedText = "";
                String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");//This regex splits the string to its linebreak-delimiters, including them. i.e. ("$$$\r\n" ---> [$,$,$,\r\n])
                for (String str : unicodeStrings) {
                    if (str.matches("\\R"))
                        str = StringEscapeUtils.escapeJava(str);

                    answerNonEscapedText += str;
                }

Error:

java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 6 
((?<=\R)|(?=\R)) 
 ^ 
 at java.util.regex.Pattern.error(Pattern.java:1924) 
 at java.util.regex.Pattern.escape(Pattern.java:2416) 
 at java.util.regex.Pattern.atom(Pattern.java:2164) 
 at java.util.regex.Pattern.sequence(Pattern.java:2046) 
 at java.util.regex.Pattern.expr(Pattern.java:1964) 
 at java.util.regex.Pattern.group0(Pattern.java:2807) 
 at java.util.regex.Pattern.sequence(Pattern.java:2018) 
 at java.util.regex.Pattern.expr(Pattern.java:1964) 
 at java.util.regex.Pattern.group0(Pattern.java:2854) 
 at java.util.regex.Pattern.sequence(Pattern.java:2018) 
 at java.util.regex.Pattern.expr(Pattern.java:1964) 
 at java.util.regex.Pattern.compile(Pattern.java:1665) 
 at java.util.regex.Pattern.<init>(Pattern.java:1337) 
 at java.util.regex.Pattern.compile(Pattern.java:1022) 
 at java.lang.String.split(String.java:2313) 
 at java.lang.String.split(String.java:2355)

Could you please help me finding the root cause of the failure?

1
  • "((?<=\\R)|(?=\\R))" contains both a look-ahead and look-behind 0 width match, and that as group. "\\R" for newline should suffice. You will not get the line separating chars ("\n", "\r\n", "\r" or "\u0085") though. look-behind and ahead probably were done for receiving the line separator and maybe for the last line. Commented Jul 25, 2019 at 13:44

2 Answers 2

1
        String answerText = uiq.getAnswerText();
        if (answerText.matches("[\\x00-\\x7F]*")) {
            sb.append("<String name=\"answerText\">")
              .append(wrapCdata(uiq.isDate()
                      ? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat)
                      : answerText))
              .append("</String>");
        } else {
            String[] unicodeStrings = answerText.split("\\R"); // Splits on linebreaks.
            // This looses the exact line delimiter.
            String answerNonEscapedText = ""; // Better StringBuilder too.
            for (String str : unicodeStrings) {
                answerNonEscapedText += str + "\\r\\n";
            }

For some cases the above loss of the original line delimiters is important: there exists CSV where a field value may contain line separators \n whereas the line ends in \r\n. Or such.

A simpler solution:

        // Java >= 9
        String answerText = Pattern.compile("\\R").matcher(uiq.getAnswerText())
            .replaceAll(mr -> StringEscapeUtils.escapeJava(mr.group()));


        // Java < 9 (only for \r and \n)
        String answerText = uiq.getAnswerText()
            .replace("\r", "\\r").replace("\n", "\\n");
Sign up to request clarification or add additional context in comments.

2 Comments

(Hanging my head at somehow having missed \R despite searching the pattern page...)
Thanks for reviewing and upgrading the code, Joop. The expression was, however, correct and I'll probably not going to refactor it. The root cause was dumber than I anticipated.
0

In this case, the regex expression was not incorrect. It was, however, supported only by java 8+ and I had java 7 on my environment. An upgrade of java solved the issue.

Pattern (Java Platform SE 7)

Perl constructs not supported by this class:

Predefined character classes (Unicode character)

\h A horizontal whitespace

\H A non horizontal whitespace

\v A vertical whitespace

\V A non vertical whitespace

\R Any Unicode linebreak sequence \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

\X Match Unicode extended grapheme cluster

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.