9

I need to clear my string from the following substrings:

\n

\uXXXX (X being a digit or a character)

e.g. "OR\n\nThe Central Site Engineering\u2019s \u201cfrontend\u201d, where developers turn to"

-> "OR The Central Site Engineering frontend , where developers turn to"
I tried using the String method replaceAll but dnt know how to overcome the \uXXXX issue as well as it didnt work for the \n

String s = "\\n";  
data=data.replaceAll(s," ");

how does this regex looks in java?

thanks for the help

6
  • 2
    Can you describe what have you tried and how did it not work? Also your text doesn't look like it should be striped from these characters, but rather they should be replaced with characters they represents like \n -> line separator, \u2019 -> , \u201c->, and so on. Commented Aug 2, 2015 at 17:24
  • So maybe you are asking how you can unescape these characters? Commented Aug 2, 2015 at 17:33
  • I need to replace them with whitespace. I dont need them since its going to be indexed with Apache lucene, I only need the words showing. Commented Aug 2, 2015 at 17:36
  • "I need to replace them with whitespace" based on your example you want to remove them (replace them with nothing) not to replace them with whitespace. But anyway this is not hard task so you must have tried something. Can we see your attempts? Commented Aug 2, 2015 at 17:40
  • dealing with \n: string.replaceAll("\\n", " "); also I tried to put \n in a string instead of writing in "inline" Commented Aug 2, 2015 at 17:43

2 Answers 2

14

Problem with string.replaceAll("\\n", " "); is that replaceAll expects regular expression, and \ in regex is special character used for instance to create character classes like \d which represents digits, or to escape regex special characters like +.

So if you want to match \ in Javas regex you need to escape it twice:

  • once in regex \\
  • and once in String "\\\\".

like replaceAll("\\\\n"," ").

You can also let regex engine do escaping for you and use replace method like

replace("\\n"," ")

Now to remove \uXXXX we can use

replaceAll("\\\\u[0-9a-fA-F]{4}","")


Also remember that Strings are immutable, so each str.replace.. call doesn't affect str value, but it creates new String. So if you want to store that new string in str you will need to use

str = str.replace(..)

So your solution can look like

String text = "\"OR\\n\\nThe Central Site Engineering\\u2019s \\u201cfrontend\\u201d, where developers turn to\"";

text = text.replaceAll("(\\\\n)+"," ")
           .replaceAll("\\\\u[0-9A-Ha-h]{4}", "");
Sign up to request clarification or add additional context in comments.

2 Comments

many thanks! needed the explanation regarding the replaceAll parameter!
@D.Shefer You are welcome. But I was able to give you this explanation only because you posted your code attempts. Without it I would only post solution without proper explanation which you would not benefit that much, so in future always post your code attempts so people would see what you are struggling with to give you best answers.
0

Best to do this in 2 parts I guess:

String ex = "OR\n\nThe Central Site Engineering\u2019s \u201cfrontend\u201d, where developers turn to";
String part1 = ex.replaceAll("\\\\n"," "); // The firs \\ replaces the backslah, \n replaces the n.
String part2 = part1.replaceAll("u\\d\\d\\d\\d","");
System.out.println(part2);

Try it =)

2 Comments

OK, I was not precise. It seems that example we see in question is not string literal, but text which could for instance be read from file. So \n is not line separator, but string representing two characters, \ and n. So your solution works, but only because you let Java compiler change \n into line separator, which than can be matched by "\n" or "\\n".
The title of this question means the need of using a regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.