7

i have a list of words:

string[] BAD_WORDS = { "xxx", "o2o" } // My list is actually a lot bigger about 100 words

and i have some text (usually short , max 250 words), which i need to REMOVE all the BAD_WORDS in it.

i have tried this:

    foreach (var word in BAD_WORDS)
    {
        string w = string.Format(" {0} ", word);
        if (input.Contains(w))
        {
            while (input.Contains(w))
            {
                input = input.Replace(w, " ");
            }
        }
    }

but, if the text starts or ends with a bad word, it will not be removed. i did it with the spaces, so it will not match partial words for example "oxxx" should not be removed, since it is not an exact match to the BAD WORDS.

anyone can give me advise on this?

5
  • 1
    Looks like a job for regular expressions. Commented Sep 1, 2012 at 9:50
  • Why have you included this line string w = string.Format(" {0} ", word);? Commented Sep 1, 2012 at 9:54
  • what is your questing, your code looks fine? Just remove the if and do a startswith and ends with. Commented Sep 1, 2012 at 9:54
  • @Nikhil Agrawal: To put spaces before and after. If you keep just the word it will also match oxxx for example. Commented Sep 1, 2012 at 9:54
  • 1
    Your if is unnecessary. It's better to start out with the while to avoid checking twice the first time. Commented Sep 1, 2012 at 12:39

7 Answers 7

18
string cleaned = Regex.Replace(input, "\\b" + string.Join("\\b|\\b",BAD_WORDS) + "\\b", "")
Sign up to request clarification or add additional context in comments.

4 Comments

Hold a moment, I missed something... working... There, fixed. :)
Hee... :) Thanks Dementic. Do as I say, not as I do. I was only trying to say that all the nesting and LINQing and looping had a simple older/tried-and-true method.
+1 for catching words at start or other boundary conditions. As a bonus, if the replace needs to be done multiple times, the regex produced can be cached for repeated use. I'd use Regex.Escape though just in case BAD_WORDS contained something significant to the regex syntax.
Maybe not perfect code as others have pointed out improvements, but +1 for using regex word boundaries instead of splitting.
6

This is a great task for Linq, and also the Split method. Try this:

return string.Join(" ", input.Split(' ').Where(w => !BAD_WORDS.Contains(w)));

3 Comments

As long as spaces suffice. This won't catch the words at the start or end, if followed by a newline, if followed by punctuation etc. If that case needs to be dealt with, the regex-based answers will do a better job.
This is adding extra spaces between words and I don't know why
The empty string was being joined with a space on both sides to the other items. I've edited the answer (and it's now neater!)
1

You could use StartWith and EndsWith methods like:

while (input.Contains(w) || input.StartsWith(w) || input.EndsWith(w) || input.IndexOf(w) > 0)
{
   input = input.Replace(w, " ");
}

Hope this will fix your problem.

2 Comments

Don't you mean OR not AND? With your test it must simultaneously start, end and contain the word.
this will still catch partial words (badword = 'aoooo', actual word='aoooome', it will remove the 'aoooo'.
1

Put the fake space's before and after the string varaible input. That way it will detect the first and last words.

input = " " + input + " ";

 foreach (var word in BAD_WORDS)
    {
        string w = string.Format(" {0} ", word);
        if (input.Contains(w))
        {
            while (input.Contains(w))
            {
                input = input.Replace(w, " ");
            }
        }
    }

Then trim the string:

input = input.Trim();

1 Comment

that is a good idea, that will fix my code, but isnt there a nicer solution to this? the code seems a little weird do me, i wrote it because i had no other idea.
1

You can store words from text to one list. Then just check all words if they are in bad list, something like this :

List<string> myWords = input.Split(' ').ToList();
List<string> badWords = GetBadWords();

myWords.RemoveAll(word => badWords.Contains(word));
string Result = string.Join(" ", myWords);

Comments

0

Just wanted to point out that you shoulde have done with just whiole inside your for like so:

   foreach (var word in BAD_WORDS)
{
    while (input.Contains(String.Format(" {0} ", word);))
    {
        input = input.Replace(w, " ");
    }
}

No need for that if and 'w' variable, in any case i wouldehave used the answer above me that Antonio Bakula, first think that came to mind was this.

1 Comment

You are trying to replace w which you have removed from the code. without the w, it will replace partial word matches also.
0

According to the following post the fastest way is to use Regex and MatchEvaluator : Replacing multiple characters in a string, the fastest way?

        Regex reg = new Regex(@"(o2o|xxx)");
        MatchEvaluator eval = match =>
        {
            switch (match.Value)
            {
                case "o2o": return " ";
                case "xxx": return " ";
                default: throw new Exception("Unexpected match!");
            }
        };
        input = reg.Replace(input, eval);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.