12

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?

1

4 Answers 4

35

If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:

s = s.replaceAll("[^\\x00-\\x7f]", "");

If you need to filter many strings, it would be better to use a precompiled pattern:

private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();

And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.

Sign up to request clarification or add additional context in comments.

9 Comments

Are Type 1 High ASCII characters same as High ASCII characters. Would above regex also remove Symbols like $ and Pound sign?
Be careful if you want to filter a lot of strings with this pattern. It will compile the pattern each time and create new String object behind the scenes.
@Jitendra: It removes all characters that are not in ASCII table.
@axtavt Is it possible to modify above regex so to allow retaining of certain characters. For e.x. I want to retain £ sign from string.
I am really new at regex's. I found It after little experiment. code s.replaceAll("[^\\x00-\\x7f£]", ""); code should work. Thanks all !!
|
16

I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.

public static String filter(String str) {
    StringBuilder filtered = new StringBuilder(str.length());
    for (int i = 0; i < str.length(); i++) {
        char current = str.charAt(i);
        if (current >= 0x20 && current <= 0x7e) {
            filtered.append(current);
        }
    }

    return filtered.toString();
}

2 Comments

Could you please explain in little detail, what do you mean by filter string by hand and check code of particular character. Did you mean above way of filtering.
THis seems to work great, except that it removes newlines for me, and netier of these work if (current >= 0x00 && current <= 0x7e) or if (current == '\n' || (...) ) which is super weird!
5

A nice way to do this is to use Google Guava CharMatcher:

String newString = CharMatcher.ASCII.retainFrom(string);

newString will contain only the ASCII characters (code point < 128) from the original string.

This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.

4 Comments

They can but the above answer axtavt is simple and can be made readable with a simple comment explaining whats happening. The regex code isn't hard at all to decode in his answer. Your answer contains libraries that need to be downloaded and setup as dependencies, much more work than axtavt's answer.
Any Java project should include this library anyway. It will save you a lot of work in the long run. Sometimes you have to do a bit of work up front to save more effort later. :)
you may be right about this java library being useful (it looks pretty good), but alas does not answer the question as best as the Pattern answer.
That depends on your definition of "best". Anyway, I can't convince you, you should use Google Guava wherever you can and let it convince you.
5

I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:

Example Code:

final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
    Normalizer
        .normalize(input, Normalizer.Form.NFD)
        .replaceAll("[^\\p{ASCII}]", "")
);

Output:

This is a funky String

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.