2

I need to replace multiple whitespaces into a single whitespace (per iteration) in a document. Doesn't matter whether they are spaces, tabs or newlines, any combination of any kind of whitespace needs to be truncated to a single whitespace.

Let's say we have the string: "Hello,\t \t\n  \t    \n world", (where \t and \n represent tabs and newlines respectively) then I'd need it to become "Hello, world".

I'm so completely bewildered by regex more generally that I ended up just asking.

Considerations:

  • I have no control over the document, since it could be any document on the internet.

  • I'm using C#, so if anyone knows how to do this in C# specifically, that would be even more awesome.

  • I don't really have to use regex (before someone asks), but I figured it's probably the optimal way, since regex is designed for this sort of stuff, and my own strpos/str_replace/substr soup would probably not perform as well. Performance is important on this one so what I'm essentially looking for is an efficient way to do this to any random text file on the internet (remember, I can't predict the size!).

Thanks in advance!

5 Answers 5

11
newString = Regex.Replace(oldString, @"\s+", " ");

The "\s" is a regex character class for any whitespace character, and the + means "one or more". It replaces each occurence with a simple space character.

Sign up to request clarification or add additional context in comments.

Comments

4

You may find this SO answer useful:

How do I replace multiple spaces with a single space in C#?

Adapting the answer to also replace tabs and newlines as well is relatively straight forward:

RegexOptions options = RegexOptions.None;
Regex regex = new Regex(@"\s+", options);     
tempo = regex.Replace(tempo, @" ");

2 Comments

Check out the answer by Matt in the above link as the accepted solution looks like it only replaces the space character, not newlines and tabs. The '\s' in the pattern is what tells it to match on any whitespace character.
I looked before I asked, I swear, I looked! Thanks a bunch, that helped me out. :)
1

As someone who sympathizes with Jamie Zawinski's position on Regex, I'll offer an alternative for what it's worth.

Not wanting to be religious about it, but I'd say it's faster than Regex, though whether you'll ever be processing strings long enough to see the difference is another matter.

    public static string CompressWhiteSpace(string value)
    {
        if (value == null) return null;

        bool inWhiteSpace = false;
        StringBuilder builder = new StringBuilder(value.Length);

        foreach (char c in value)
        {
            if (Char.IsWhiteSpace(c))
            {
                inWhiteSpace = true;
            }
            else
            {
                if (inWhiteSpace) builder.Append(' ');
                inWhiteSpace = false;
                builder.Append(c);
            }
        }
        return builder.ToString();
    }

Comments

0
I would suggest you replace your chomp with
 $line =~ s/\s+$//;

which will strip off all trailing white spaces - tabs, spaces, new lines and returns as well.

Taken from: http://www.wellho.net/forum/Perl-Programming/New-line-characters-beware.html

I'm aware its Perl, but it should be helpful enough for you.

Comments

0

Actually I think an extension method would probably be more efficient as you don't have the state machine overhead of the regex. Essentially, it becomes a very specialized pattern matcher.

public static string Collapse( this string source )
{
    if (string.IsNullOrEmpty( source ))
    {
        return source;
    }

    StringBuilder builder = new StringBuilder();
    bool inWhiteSpace = false;
    bool sawFirst = false;
    foreach (var c in source)
    {
        if (char.IsWhiteSpace(c))
        {
            inWhiteSpace = true;
        }
        else
        {
            // only output a whitespace if followed by non-whitespace
            // except at the beginning of the string
            if (inWhiteSpace && sawFirst)
            {
                builder.Append(" ");
            }
            inWhiteSpace = false;
            sawFirst = true;
            builder.Append(c);
        }
    }
    return builder.ToString();
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.