21

How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?

Updated:

Please tell me the Unicode character string or regex that correspond to "a black diamond with question mark in it".

4

11 Answers 11

21

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display. If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder. This is defined as U+FFFD: �. Its appearance varies depending on the font you're using.

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.

Sign up to request clarification or add additional context in comments.

4 Comments

This sounds interesting and could lead to the ideal solution I need. Please, when you get a chance, could you give me an example of this?
I will expand on this answer this afternoon.
Tried @Chris's help, didn't help either :(
The strange character is still not removed from the string.
20

You can use a String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")

There is no Character.isStrangeAndUnWanted(), you have to define what you want.

If you want to remove control characters you can do

String str = "\u0000\u001f hi \n";
str = str.replaceAll("[\u0000-\u001f]", "");

prints hi (keeps the space).

EDIT If you want to know the unicode of any 16-bit character you can do

int num = string.charAt(n);
System.out.println(num);

9 Comments

There is no Character.isStrangeAndUnWanted() - Ruby has it :p
I want to know what's the Unicode character that correspond to "a black diamond with question mark in it".
It does if you are trying to remove UUUUFFFD ;) perhaps try replaceAll("\uFFFD", ""); instead.
|
7

To delete non-Latin symbols from the string I use the following code:

String s = "小米体验版 latin string 01234567890";
s = s.replaceAll("[^\\x00-\\x7F]", "");

The output string will be: " latin string 01234567890"

Comments

4

Justin Thomas's was close, but this is probably closer to what you're looking for:

String nonStrange = strangeString.replaceAll("\\p{Cntrl}", ""); 

The selector \p{Cntrl} selects "A control character: [\x00-\x1F\x7F]."

Comments

3

I did the other way. I replace all letters that are not defined ((^)):

str.replaceAll("[^a-zA-Z0-9:;.?! ]","")

so for words like : "小米体验版 latin string 01234567890" we will get: "latin string 01234567890"

Comments

2

Use String.replaceAll( ):

String clean = "♠clean".replaceAll('♠', '');

Comments

0

Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:

String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes

for (String s : badChar) {
   String resultStr = str.replaceAll(s, str);
}

you will end up with a cleaned string "resultStr" haven't tested this but along the lines.

3 Comments

I can't replace anything if I don't know the corresponding word or letter for "a black diamond with question mark in it". If you know what it is, please tell me.
@user224270 you really should read @Syrion and @Stu's "answer" (Joel's Unicode rant link) about unicode characters. You're "seeing" a placeholder representation of a character your application doesn't have a font representation for. It is not actually a unique character itself. Get educated on it, you'll figure it out.
0

same happened with me when i was converting clob to string using getAsciiStream.

efficiently solved it using

public String getstringfromclob(Clob cl)
{
    StringWriter write = new StringWriter();
    try{
        Reader read  = cl.getCharacterStream();     
    int c = -1;
    while ((c = read.read()) != -1)
    {
        write.write(c);
    }
    write.flush();
    }catch(Exception ec)
    {
        ec.printStackTrace();
    }
    return write.toString();

}

1 Comment

You can use regex [^\w\s<='">\.()/,#%&-:;!@\$*] to identify such symbol and then use string.replaceAll()
0

filter English ,Chinese,number and punctuation

str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");

Comments

0

Most probably the text that you got was encoded in something other than UTF-8. What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:

try {

  CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
  charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);

  return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
  // throw an exception saying the file was not saved with UTF-8 encoding.
}

Comments

-3

You can't because strings are immutable.

It is possible, though, to make a new string that has the unwanted characters removed. Look up String#replaceAll().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.