How to filter string for unwanted characters using regex?

Question

Basically, I am wondering if there is a handy class or method to filter a String for unwanted characters. The output of the method should be the 'cleaned' String. I.e.:

String dirtyString = "This contains spaces which are not allowed"

String result = cleaner.getCleanedString(dirtyString);

Expecting result would be:

"Thiscontainsspaceswhicharenotallowed"

A better example:

String reallyDirty = " this*is#a*&very_dirty&String"

String result = cleaner.getCleanedString(dirtyString);

I expect the result to be:

"thisisaverydirtyString"

Because I let the cleaner know that ' ', '*', '#', '&' and '_' are dirty characters. I can solve it by using a white/black list array of chars. But I don't want to re-invent the wheel.

I was wondering if there is already such a thing that can 'clean' strings using a regex instead of writing this myself.

Addition:

If you think cleaning a String could be done differently/better then I am, all ears as well of course.

Another addition:

It is not only for spaces, but for any kind of 'dirty' characters.

So really unwanted characters means anything that is not a a-z or 0-9? I updated my answer, but it is still unclear what is a dirty character and what is a clean one. — jzd
– jzd, Commented Feb 9, 2011 at 14:32

jzd · Accepted Answer · 2011-02-09 14:31:03Z

56

Edited based on your update:

dirtyString.replaceAll("[^a-zA-Z0-9]","")

edited Feb 9, 2011 at 14:31

answered Feb 9, 2011 at 13:50

jzd

23.6k9 gold badges58 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jzd Over a year ago

@Tim, the definition of a clean vs dirty character from Stefan has been a little unclear. Originally it was just spaces.

Stefan Hendriks Over a year ago

Clean letters are just a-z and A-Z , no special chars. Looks like the replaceAll Will be what I need

ieselisra Over a year ago

--SOLVED, silly fail-- the method doesn't modify the String,returns the modified one String temp2 = tempString.replaceAll(",",".");

Franklin Yu · Accepted Answer · 2018-10-10 03:25:21Z

14

If you're using guava on your project (and if you're not, I believe you should consider it), the CharMatcher class handles this very nicely:

Your first example might be:

result = CharMatcher.WHITESPACE.removeFrom(dirtyString);

while your second might be:

result = CharMatcher.anyOf(" *#&").removeFrom(dirtyString);
// or alternatively
result = CharMatcher.noneOf(" *#&").retainFrom(dirtyString);

or if you want to be more flexible with whitespace (tabs etc), you can combine them rather than writing your own:

CharMatcher illegal = CharMatcher.WHITESPACE.or(CharMatcher.anyOf("*#&"));
result = illegal.removeFrom(dirtyString);

or you might instead specify legal characters, which depending on your requirements might be:

CharMatcher legal = CharMatcher.JAVA_LETTER; // based on Unicode char class
CharMatcher legal = CharMatcher.ASCII.and(CharMatcher.JAVA_LETTER); // only letters which are also ASCII, as your examples
CharMatcher legal = CharMatcher.inRange('a', 'z'); // lowercase only
CharMatcher legal = CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z')); // either case

followed by retainFrom(dirtyString) as above.

Very nice, powerful API.

edited Oct 10, 2018 at 3:25

Franklin Yu

10.1k7 gold badges50 silver badges60 bronze badges

answered Feb 9, 2011 at 23:46

Cowan

37.7k11 gold badges70 silver badges65 bronze badges

1 Comment

Bernhard Döbler Over a year ago

Links broken :(

Nicolas · Accepted Answer · 2011-02-09 13:51:29Z

9

Use replaceAll.

answered Feb 9, 2011 at 13:51

Nicolas

24.8k5 gold badges62 silver badges67 bronze badges

Comments

trojanfoe · Accepted Answer · 2011-02-09 14:03:43Z

7

This will do it:

String dirtyString = "This contains spaces which are not allowed";
String result = dirtyString.replaceAll("\\s", "");

and works by replacing all whitespace with 'nothing'.

edited Feb 9, 2011 at 14:03

answered Feb 9, 2011 at 13:50

trojanfoe

123k23 gold badges219 silver badges249 bronze badges

Comments

Tim Pietzcker · Accepted Answer · 2011-02-09 14:40:14Z

6

String resultString = subjectString.replaceAll("\\P{L}+", "");

will replace any non-letter characters with nothing.

answered Feb 9, 2011 at 14:40

Tim Pietzcker

337k59 gold badges521 silver badges572 bronze badges

1 Comment

dbaltor Over a year ago

Clever answer using the regex pattern for the whole category of Unicode letters.

Robert Fornesdale · Accepted Answer · 2018-05-08 08:44:06Z

0

I also prefer the whitelisting-approach. You'll never know what comes around. There seem to be more encodings in than characters. This way you can control it all:

public String convert(String s) {
  s = StringUtils.removePattern(s, "[^A-Za-zäöüÄÖÜß?!$,. 0-9\\-\\+\\*\\?=&%\\$§\"\\!\\^#:;,_²³°\\[\\]\\{\\}<>\\|~]'`'");
  return s.trim();
}

This contains all german umlauts and french accents and ... you know - just look at your keyboard. I think I picked them all. Feel free to omit special chars like < > to prevent code-injection...

answered May 8, 2018 at 8:44

Robert Fornesdale

801 silver badge6 bronze badges

Comments

Basil Bourque · Accepted Answer · 2022-08-22 21:21:17Z

Filter code points

Regex is not the only avenue to your goal. You can get the code point integer number for each character in your string, then filter out those not considered a letter in Unicode.

The String#codePoints method returns an IntStream, a stream of int primitive values, one per character.

The Character class can tell us if the character assigned to each of those code point numbers in Unicode is considered a letter, as opposed to whitespace, digits, punctuation, and so on.

Those code points passing our test are converted back to a String by way of the StringBuilder class.

String input = " this*is#a*&very_dirty&String" ; 
String onlyLetters = 
        input 
        .codePoints()
        .filter(
            codePoint -> Character.isLetter( codePoint ) 
        )
        .collect(               
            StringBuilder :: new ,        
            StringBuilder :: appendCodePoint , 
            StringBuilder :: append    
        )        
        .toString() 
;

See this code run live at Ideone.com.

thisisaverydirtyString

Collectives™ on Stack Overflow

How to filter string for unwanted characters using regex?

7 Answers 7

3 Comments

1 Comment

Comments

Comments

1 Comment

Comments

Filter code points

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

1 Comment

Comments

Comments

1 Comment

Comments

Filter code points

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related