When a user input contains Unicode characters (e.g. ‘ or ” ), the following action fails:
String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");
I've tried debugging the split method, but I haven't found the root cause. I have a hunch it has something to do with the question mark (?) in the expression.
I've also tried an online java regex tool and applied the expression on some text with the following characters ‘”. It didn't show any error.
I've also tried writing a simple test method in online java compiler where I passed a test string with the ‘” characters and performed the above-mentioned split. No error either.
Code:
String answerText = uiq.getAnswerText();
if (answerText.matches("[\\x00-\\x7F]*")) //if the answerString consists only of ascii characters we encode it
sb.append("<String name=\"answerText\">")
.append(wrapCdata(uiq.isDate() ? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat) : answerText)).append("</String>");
else { //if the answerString consists of unicode characters we encode only the Linebreakers (the \R)
String answerNonEscapedText = "";
String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");//This regex splits the string to its linebreak-delimiters, including them. i.e. ("$$$\r\n" ---> [$,$,$,\r\n])
for (String str : unicodeStrings) {
if (str.matches("\\R"))
str = StringEscapeUtils.escapeJava(str);
answerNonEscapedText += str;
}
Error:
java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 6
((?<=\R)|(?=\R))
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.escape(Pattern.java:2416)
at java.util.regex.Pattern.atom(Pattern.java:2164)
at java.util.regex.Pattern.sequence(Pattern.java:2046)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.group0(Pattern.java:2807)
at java.util.regex.Pattern.sequence(Pattern.java:2018)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.group0(Pattern.java:2854)
at java.util.regex.Pattern.sequence(Pattern.java:2018)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.split(String.java:2313)
at java.lang.String.split(String.java:2355)
Could you please help me finding the root cause of the failure?
"((?<=\\R)|(?=\\R))"contains both a look-ahead and look-behind 0 width match, and that as group."\\R"for newline should suffice. You will not get the line separating chars ("\n", "\r\n", "\r" or "\u0085") though. look-behind and ahead probably were done for receiving the line separator and maybe for the last line.