Java - removing strange characters from a String

Question

How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?

Updated:

Please tell me the Unicode character string or regex that correspond to "a black diamond with question mark in it".

Maybe you want to use the right encoding of the text instead? — Alp
– Alp, Commented Mar 28, 2011 at 17:29
possible duplicate of Need help removing strange characters from string — Saurabh Gokhale
– Saurabh Gokhale, Commented Mar 28, 2011 at 17:30

asthasr · Accepted Answer · 2011-03-28 18:44:54Z

21

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display. If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder. This is defined as U+FFFD: �. Its appearance varies depending on the font you're using.

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.

edited Mar 28, 2011 at 18:44

answered Mar 28, 2011 at 17:31

asthasr

9,4872 gold badges31 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user224270 Over a year ago

This sounds interesting and could lead to the ideal solution I need. Please, when you get a chance, could you give me an example of this?

asthasr Over a year ago

I will expand on this answer this afternoon.

user224270 Over a year ago

Tried @Chris's help, didn't help either :(

user224270 Over a year ago

The strange character is still not removed from the string.

Peter Lawrey · Accepted Answer · 2011-03-28 17:56:16Z

20

You can use a String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")

There is no Character.isStrangeAndUnWanted(), you have to define what you want.

If you want to remove control characters you can do

String str = "\u0000\u001f hi \n";
str = str.replaceAll("[\u0000-\u001f]", "");

prints hi (keeps the space).

EDIT If you want to know the unicode of any 16-bit character you can do

int num = string.charAt(n);
System.out.println(num);

edited Mar 28, 2011 at 17:56

answered Mar 28, 2011 at 17:29

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

9 Comments

Armand Over a year ago

There is no Character.isStrangeAndUnWanted() - Ruby has it :p

Trufa Over a year ago

download.oracle.com/javase/1.4.2/docs/api/java/lang/… bummer!

user224270 Over a year ago

I want to know what's the Unicode character that correspond to "a black diamond with question mark in it".

Peter Lawrey Over a year ago

googling your text finds en.wikipedia.org/wiki/Unicode_Specials#Replacement_character

Peter Lawrey Over a year ago

It does if you are trying to remove UUUUFFFD ;) perhaps try replaceAll("\uFFFD", ""); instead.

|

BurgerZ · Accepted Answer · 2014-04-18 07:44:42Z

7

To delete non-Latin symbols from the string I use the following code:

String s = "小米体验版 latin string 01234567890";
s = s.replaceAll("[^\\x00-\\x7F]", "");

The output string will be: " latin string 01234567890"

answered Apr 18, 2014 at 7:44

BurgerZ

2424 silver badges5 bronze badges

Comments

codekiln · Accepted Answer · 2011-06-14 19:13:32Z

4

Justin Thomas's was close, but this is probably closer to what you're looking for:

String nonStrange = strangeString.replaceAll("\\p{Cntrl}", "");

The selector \p{Cntrl} selects "A control character: [\x00-\x1F\x7F]."

answered Jun 14, 2011 at 19:13

codekiln

3,7502 gold badges41 silver badges53 bronze badges

Comments

John Tribe · Accepted Answer · 2020-09-21 19:39:17Z

3

I did the other way. I replace all letters that are not defined ((^)):

str.replaceAll("[^a-zA-Z0-9:;.?! ]","")

so for words like : "小米体验版 latin string 01234567890" we will get: "latin string 01234567890"

answered Sep 21, 2020 at 19:39

John Tribe

1,68219 silver badges28 bronze badges

Comments

Mike Atlas · Accepted Answer · 2011-03-28 17:32:33Z

2

Use String.replaceAll( ):

String clean = "♠clean".replaceAll('♠', '');

edited Mar 28, 2011 at 17:32

answered Mar 28, 2011 at 17:31

Mike Atlas

8,2414 gold badges48 silver badges63 bronze badges

Comments

z.eljayyo · Accepted Answer · 2011-03-28 17:42:47Z

0

Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:

String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes

for (String s : badChar) {
   String resultStr = str.replaceAll(s, str);
}

you will end up with a cleaned string "resultStr" haven't tested this but along the lines.

answered Mar 28, 2011 at 17:42

z.eljayyo

1,2891 gold badge10 silver badges16 bronze badges

3 Comments

user224270 Over a year ago

I can't replace anything if I don't know the corresponding word or letter for "a black diamond with question mark in it". If you know what it is, please tell me.

Mike Atlas Over a year ago

@user224270 you really should read @Syrion and @Stu's "answer" (Joel's Unicode rant link) about unicode characters. You're "seeing" a placeholder representation of a character your application doesn't have a font representation for. It is not actually a unique character itself. Get educated on it, you'll figure it out.

z.eljayyo Over a year ago

stackoverflow.com/questions/1611979/…

vinay · Accepted Answer · 2014-04-08 11:31:37Z

0

same happened with me when i was converting clob to string using getAsciiStream.

efficiently solved it using

public String getstringfromclob(Clob cl)
{
    StringWriter write = new StringWriter();
    try{
        Reader read  = cl.getCharacterStream();     
    int c = -1;
    while ((c = read.read()) != -1)
    {
        write.write(c);
    }
    write.flush();
    }catch(Exception ec)
    {
        ec.printStackTrace();
    }
    return write.toString();

}

answered Apr 8, 2014 at 11:31

vinay

1,1312 gold badges12 silver badges17 bronze badges

1 Comment

vinay Over a year ago

You can use regex [^\w\s<='">\.()/,#%&-:;!@\$*] to identify such symbol and then use string.replaceAll()

Jiajun Shen · Accepted Answer · 2017-09-13 10:15:12Z

0

filter English ,Chinese,number and punctuation

str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");

answered Sep 13, 2017 at 10:15

Jiajun Shen

2112 silver badges5 bronze badges

Comments

mihai_f87 · Accepted Answer · 2020-05-25 08:11:48Z

0

Most probably the text that you got was encoded in something other than UTF-8. What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:

try {

  CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
  charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);

  return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
  // throw an exception saying the file was not saved with UTF-8 encoding.
}

answered May 25, 2020 at 8:11

mihai_f87

493 bronze badges

Comments

Ingo · Accepted Answer · 2011-03-28 17:30:20Z

-3

You can't because strings are immutable.

It is possible, though, to make a new string that has the unwanted characters removed. Look up String#replaceAll().

answered Mar 28, 2011 at 17:30

Ingo

36.4k6 gold badges57 silver badges102 bronze badges

Collectives™ on Stack Overflow

Java - removing strange characters from a String

11 Answers 11

4 Comments

9 Comments

Comments

Comments

Comments

Comments

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

4 Comments

9 Comments

Comments

Comments

Comments

Comments

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related