38

Basically, I am wondering if there is a handy class or method to filter a String for unwanted characters. The output of the method should be the 'cleaned' String. I.e.:

String dirtyString = "This contains spaces which are not allowed"

String result = cleaner.getCleanedString(dirtyString);

Expecting result would be:

"Thiscontainsspaceswhicharenotallowed"

A better example:

String reallyDirty = " this*is#a*&very_dirty&String"

String result = cleaner.getCleanedString(dirtyString);

I expect the result to be:

"thisisaverydirtyString"

Because I let the cleaner know that ' ', '*', '#', '&' and '_' are dirty characters. I can solve it by using a white/black list array of chars. But I don't want to re-invent the wheel.

I was wondering if there is already such a thing that can 'clean' strings using a regex instead of writing this myself.

Addition:

  • If you think cleaning a String could be done differently/better then I am, all ears as well of course.

Another addition:

  • It is not only for spaces, but for any kind of 'dirty' characters.
1
  • So really unwanted characters means anything that is not a a-z or 0-9? I updated my answer, but it is still unclear what is a dirty character and what is a clean one. Commented Feb 9, 2011 at 14:32

7 Answers 7

56

Edited based on your update:

dirtyString.replaceAll("[^a-zA-Z0-9]","")
Sign up to request clarification or add additional context in comments.

3 Comments

@Tim, the definition of a clean vs dirty character from Stefan has been a little unclear. Originally it was just spaces.
Clean letters are just a-z and A-Z , no special chars. Looks like the replaceAll Will be what I need
--SOLVED, silly fail-- the method doesn't modify the String,returns the modified one String temp2 = tempString.replaceAll(",",".");
14

If you're using guava on your project (and if you're not, I believe you should consider it), the CharMatcher class handles this very nicely:

Your first example might be:

result = CharMatcher.WHITESPACE.removeFrom(dirtyString);

while your second might be:

result = CharMatcher.anyOf(" *#&").removeFrom(dirtyString);
// or alternatively
result = CharMatcher.noneOf(" *#&").retainFrom(dirtyString);

or if you want to be more flexible with whitespace (tabs etc), you can combine them rather than writing your own:

CharMatcher illegal = CharMatcher.WHITESPACE.or(CharMatcher.anyOf("*#&"));
result = illegal.removeFrom(dirtyString);

or you might instead specify legal characters, which depending on your requirements might be:

CharMatcher legal = CharMatcher.JAVA_LETTER; // based on Unicode char class
CharMatcher legal = CharMatcher.ASCII.and(CharMatcher.JAVA_LETTER); // only letters which are also ASCII, as your examples
CharMatcher legal = CharMatcher.inRange('a', 'z'); // lowercase only
CharMatcher legal = CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z')); // either case

followed by retainFrom(dirtyString) as above.

Very nice, powerful API.

1 Comment

Links broken :(
9

Use replaceAll.

Comments

7

This will do it:

String dirtyString = "This contains spaces which are not allowed";
String result = dirtyString.replaceAll("\\s", "");

and works by replacing all whitespace with 'nothing'.

Comments

6
String resultString = subjectString.replaceAll("\\P{L}+", "");

will replace any non-letter characters with nothing.

1 Comment

Clever answer using the regex pattern for the whole category of Unicode letters.
0

I also prefer the whitelisting-approach. You'll never know what comes around. There seem to be more encodings in than characters. This way you can control it all:

public String convert(String s) {
  s = StringUtils.removePattern(s, "[^A-Za-zäöüÄÖÜß?!$,. 0-9\\-\\+\\*\\?=&%\\$§\"\\!\\^#:;,_²³°\\[\\]\\{\\}<>\\|~]'`'");
  return s.trim();
}

This contains all german umlauts and french accents and ... you know - just look at your keyboard. I think I picked them all. Feel free to omit special chars like < > to prevent code-injection...

Comments

0

Filter code points

Regex is not the only avenue to your goal. You can get the code point integer number for each character in your string, then filter out those not considered a letter in Unicode.

The String#codePoints method returns an IntStream, a stream of int primitive values, one per character.

The Character class can tell us if the character assigned to each of those code point numbers in Unicode is considered a letter, as opposed to whitespace, digits, punctuation, and so on.

Those code points passing our test are converted back to a String by way of the StringBuilder class.

String input = " this*is#a*&very_dirty&String" ; 
String onlyLetters = 
        input 
        .codePoints()
        .filter(
            codePoint -> Character.isLetter( codePoint ) 
        )
        .collect(               
            StringBuilder :: new ,        
            StringBuilder :: appendCodePoint , 
            StringBuilder :: append    
        )        
        .toString() 
;

See this code run live at Ideone.com.

thisisaverydirtyString

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.