I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?
4 Answers
If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:
s = s.replaceAll("[^\\x00-\\x7f]", "");
If you need to filter many strings, it would be better to use a precompiled pattern:
private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();
And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.
9 Comments
String object behind the scenes.code s.replaceAll("[^\\x00-\\x7f£]", ""); code should work. Thanks all !!I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.
public static String filter(String str) {
StringBuilder filtered = new StringBuilder(str.length());
for (int i = 0; i < str.length(); i++) {
char current = str.charAt(i);
if (current >= 0x20 && current <= 0x7e) {
filtered.append(current);
}
}
return filtered.toString();
}
2 Comments
A nice way to do this is to use Google Guava CharMatcher:
String newString = CharMatcher.ASCII.retainFrom(string);
newString will contain only the ASCII characters (code point < 128) from the original string.
This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.
4 Comments
I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:
Example Code:
final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
Normalizer
.normalize(input, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "")
);
Output:
This is a funky String