49

How do I strip non alphanumeric characters from a string and loose spaces in C# with Replace?

I want to keep a-z, A-Z, 0-9 and nothing more (not even " " spaces).

"Hello there(hello#)".Replace(regex-i-want, "");

should give

"Hellotherehello"

I have tried "Hello there(hello#)".Replace(@"[^A-Za-z0-9 ]", ""); but the spaces remain.

5
  • 1
    How about first defining what exactly you mean by alpha numeric? Do you just want A-Z,a-z,0-9? Unicode has plenty more letters and numbers. Commented Jan 8, 2012 at 16:36
  • 2
    With that edit, it looks much better - taking back my minus vote. Commented Jan 8, 2012 at 16:46
  • 1
    Why do you have a space in your bracket? And string.Replace doesn't take a regex in the first place. Commented Jan 8, 2012 at 17:04
  • 1
    Just to be absolutely clear: You don't want a letter like ä either? Commented Jan 8, 2012 at 17:12
  • I answered my question taking your tips into account (see below). Commented Jan 8, 2012 at 17:29

8 Answers 8

70

In your regex, you have excluded the spaces from being matched (and you haven't used Regex.Replace() which I had overlooked completely...):

result = Regex.Replace("Hello there(hello#)", @"[^A-Za-z0-9]+", "");

should work. The + makes the regex a bit more efficient by matching more than one consecutive non-alphanumeric character at once instead of one by one.

If you want to keep non-ASCII letters/digits, too, use the following regex:

@"[^\p{L}\p{N}]+"

which leaves

BonjourmesélèvesGutenMorgenliebeSchüler

instead of

BonjourmeslvesGutenMorgenliebeSchler
Sign up to request clarification or add additional context in comments.

6 Comments

I tried this...it's very close but it seems to leave spaces in - I want them stripped too! Thanks.
No, it doesn't. Unless you have special spaces in there like non-breakable space ASCII 160 (and the second version correctly removes those, too).
Hmmm I tried the following: string t = "hello there - ( efrwef )"; string a = "New: " + t.Replace(@"[^\p{L}\p{N}]+", ""); and a ends up being "hello there - ( efrwef )" - completely unchanged - I know I'm doing something wrong here.
string.Replace doesn't take a regex.
AHHH that would explain all. So, how could I do what is described above with regex bits and pieces in C#?
|
23

You can use Linq to filter out required characters:

  String source = "Hello there(hello#)";

  // "Hellotherehello"
  String result = new String(source
    .Where(ch => Char.IsLetterOrDigit(ch))
    .ToArray());

Or

  String result = String.Concat(source
    .Where(ch => Char.IsLetterOrDigit(ch)));  

And so you have no need in regular expressions.

3 Comments

Great addition! Would be interesting to know the relative performance of this to the Regex solution. Out of the gate, it reads a lot better.
A quick test in LinqPad suggests there's negligible difference between this and even a compiled Regex solution. Readability wins for me.
Looks really neat and readable, if performance same, I'm using it thanks. NB for new programmers like me, this means you need to add the line using System.Linq; at the top of the file for the C# compiler to recognise method Where.
3

Or you can do this too:

    public static string RemoveNonAlphanumeric(string text)
    {
        StringBuilder sb = new StringBuilder(text.Length);

        for (int i = 0; i < text.Length; i++)
        {
            char c = text[i];
            if (c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= '0' && c <= '9')
                sb.Append(text[i]);
        }

        return sb.ToString();
    }

Usage:

string text = SomeClass.RemoveNonAlphanumeric("text LaLa (lol) á ñ $ 123 ٠١٢٣٤");

//text: textLaLalol123

4 Comments

While I like the general approach, it doesn't fit the requirement of only allowing A-Z,a-z,0-9. It allows other letters and digits too.
There are more than 10 digits in unicode too. ٠١٢٣٤ are some examples.
Sorry, but it's still wrong. ToLower uses the current locale. So when you run in in Turkey, it won't allow I, but allows İ instead. en.wikipedia.org/wiki/Dotted_and_dotless_I
@CodeInChaos wow... guess my laziness took me to do that. Fixed :)
2

The mistake made above was using Replace incorrectly (it doesn't take regex, thanks CodeInChaos).

The following code should do what was specified:

Regex reg = new Regex(@"[^\p{L}\p{N}]+");//Thanks to Tim Pietzcker for regex
string regexed = reg.Replace("Hello there(hello#)", "");

This gives:

regexed = "Hellotherehello"

Comments

2

And as a replace operation as an extension method:

public static class StringExtensions
{
    public static string ReplaceNonAlphanumeric(this string text, char replaceChar)
    {
        StringBuilder result = new StringBuilder(text.Length);

        foreach(char c in text)
        {
            if(c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= '0' && c <= '9')
                result.Append(c);
            else
                result.Append(replaceChar);
        }

        return result.ToString();
    } 
}

And test:

[TestFixture]
public sealed class StringExtensionsTests
{
    [Test]
    public void Test()
    {
        Assert.AreEqual("text_LaLa__lol________123______", "text LaLa (lol) á ñ $ 123 ٠١٢٣٤".ReplaceNonAlphanumeric('_'));
    }
}

Comments

1
var text = "Hello there(hello#)";

var rgx = new Regex("[^a-zA-Z0-9]");

text = rgx.Replace(text, string.Empty);

1 Comment

Welcome on SO. A little explanation always make your answer more valuable. On SO, people tend to like to know why, instead of just how. ;)
-2

Use following regex to strip those all characters from the string using Regex.Replace

([^A-Za-z0-9\s])

3 Comments

'string.Replace()' does not take regex as an argument
@PostureOfLearning Thank you for your remark but you should look at the question.. the quesiton is not about the replace method it is about the Regex. the usage of method is copied from the question it self provided with helpful regex. Kindly take back your vote :)
I understand the question and I realize that the question also has invalid code. However, I accept invalid code in a question since they are trying to learn, but I find incorrect code in an answer not acceptable. It is an answer and should work. Your answer lead me in the wrong direction when looking to solve my own problem. Having said this, if you want to change it I'll be happy to take back the vote ;)
-6

In .Net 4.0 you can use the IsNullOrWhitespace method of the String class to remove the so called white space characters. Please take a look here http://msdn.microsoft.com/en-us/library/system.string.isnullorwhitespace.aspx However as @CodeInChaos pointed there are plenty of characters which could be considered as letters and numbers. You can use a regular expression if you only want to find A-Za-z0-9.

1 Comment

Do yourself and SO a favor and remove this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.