6

I need to remove words from a string based on a set of words:

Words I want to remove:

DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND

If I receive a string like:

EDIT: This string is already "cleaned" from any symbols

THIS IS AN AMAZING WEBSITE AND LAYOUT

The result should be:

THIS IS AMAZING WEBSITE LAYOUT

So far I have:

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });

    string pattern = "";

    foreach (string word in splitWords)
    {
        pattern = @"\b" + word + "\b";
        stringToClean = Regex.Replace(stringToClean, pattern, "");
    }

    return stringToClean;
}

But it's not removing the words, any idea?

I don't know if I'm using the most eficient way to do it, maybe put the words in a array just to avoid spliting them all the time?

Thanks

5
  • What output are you getting by your code? Commented Jul 16, 2013 at 14:09
  • 10
    I don't know C# that well but should the second "\b" have a @ in front? Commented Jul 16, 2013 at 14:09
  • 2
    What if the sentence starts with A? Commented Jul 16, 2013 at 14:13
  • To all the answerers whose solution's support just this example, you could just do return "THIS IS AMAZING WEBSITE LAYOUT"; Commented Jul 16, 2013 at 14:31
  • @Jodrell, but you always have som special preconditions, such as no special characters. Commented Jul 16, 2013 at 14:56

7 Answers 7

9
private static List<string> wordsToRemove =
    "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ').ToList();

public static string StringWordsRemove(string stringToClean)
{
    return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove));
}

Modification to handle punctuations:

public static string StringWordsRemove(string stringToClean)
{
    // Define how to tokenize the input string, i.e. space only or punctuations also
    return string.Join(" ", stringToClean
        .Split(new[] { ' ', ',', '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries)
        .Except(wordsToRemove));
}
Sign up to request clarification or add additional context in comments.

6 Comments

but, what if stringToClean has punctuation?
Hi, thanks for your help. I have choose your answer for been the faster, with a no iteration's solution. Regards.
what about all the punctuation like ", £, $, %, ^, &, (, ), -, _, +, =, [, ], {, }, :; ;, @, #, ~ etc. etc.
@Jodrell, If you have a very limited set, you can plug them all in the modified verion's Split() call, though the OP said he has removed them from the input already. For the sake of discussion, I'd suggest to solve the problem in 2 steps: 1) preprocess the string to remove any punctuations, 2) tokenize and remove the unwanted words. For 1), you can check the answer in here.
@Patrick, I did a performance test on my system, with your test data, this Linq method is about 4x faster that the Regex approach in my answer. +1 from me. Test code available if anyboy is interested. I'd suspect there might be some variation as stringToClean grows but that wasn't the question.
|
1

I just changed this line

pattern = @"\b" + word + "\b";

to this

pattern = @"\b" + word + @"\b"; //added '@' 

and I got the result

THIS IS AMAZING WEBSITE LAYOUT

and it would be better if you use String.Empty instead of "" like:

stringToClean = Regex.Replace(stringToClean, pattern, String.Empty);

4 Comments

I agree with you points but you could reduce iteration by creating a unified expression. stackoverflow.com/a/17679108/659190
Hi, thanks for your help. I have choose @Fung's answer for been the faster, and with no iteration. Regards.
@Patrick Fung's answer performs the iteration when you evaluate the Except.
@Jodrell, Sorry I didn't know.
1

I used LINQ

string exceptions = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND";
string[] exceptionsList = exceptions.Split(' ');

string test  ="THIS IS AN AMAZING WEBSITE AND LAYOUT";
string[] wordList = test.Split(' ');

string final = null;
var result = wordList.Except(exceptionsList).ToArray();
final = String.Join(" ",result);

Console.WriteLine(final);

4 Comments

That's beautifully done! Just as explicit and accurate as functional programming should be!
however, if the stringToClean contains word boundries that are not spaces, like ',', '.', '?', '"', ... you are in a world of pain. Note, this set of word boundries is large and growing.
more feedback then: Just do return String.Join(" ",result);
Hi, thanks for your help. I have choose @Fung's answer for been the faster, with a no iteration's solution. Regards.
0
public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
    string pattern = " (" + string.Join("|", splitWords) + ") ";
    string cleaned=Regex.Replace(stringToClean, pattern, " ");
    return cleaned;
}

2 Comments

like my answer but later.
Hi, thanks for your help. I have choose Fung's answer for been the faster with a functional solution. Regards.
0

Output you get "THIS IS AMAZING WEBSITE LAYOUT".

I was getting an issue where by it was leaving the word "D" (so it was THIS IS AN AMAZING WEBSITE D LAYOUT) in the result because if you use replace it replaces only a certain part of the word. This removed the entire word if the characters you defined are detected (I imagine this is what you want?).

        string[] tabooWords = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ');
        string text = "THIS IS AN AMAZING WEBSITE AND LAYOUT";
        string result = text;

        foreach (string word in text.Split(' '))
        {
            if (tabooWords.Contains(word.ToUpper()))
            {
                int start = result.IndexOf(word);
                result = result.Remove(start, word.Length);
            }
        }

2 Comments

won't this strip all the As, Es and Os etc?
Hi, thanks for your help. I have choose your answer for been the faster, with a no iteration's solution and that I can user with any WordsToRemoveStrin. Regards.
0

Or...

stringToClean = Regex.Replace(stringToClean, @"\bDE\b|\bDA\b|\bDAS\b|\bDO\b|\bDOS\b|\bAN\b|\bNAS\b|\bNO\b|\bNOS\b|\bEM\b|\bE\b|\bA\b|\bAS\b|\bO\b|\bOS\b|\bAO\b|\bAOS\b|\bP\b|\bLDA\b|\bAND\b", String.Empty);
stringToClean = Regex.Replace(stringToClean, "  ", String.Empty);

3 Comments

erm, why not type @"\b(DE|DA|DAS|DO|DOS|AN|NAS|NO|NOS|EM|E|A|AS|O|OS|AO|OS|P|LDA|AND)\b"
@Jodrell - Because, that would be too easy. :) Thanks.
Hi, thanks for your help. I have choose Fung's answer for been the faster, with a no iteration's solution and that I can use with any WordsToRemoveString. Regards.
0

how about,

// make a pattern to match all words 
var pattern = string.Format(
    @"\b({0})\b",
    string.Join("|", wordsToremove.Split(new[] { ' ' })));

// pattern will be of the form "\b(badword1|badword2|...)\b"

// remove all the bad words from the string in one go.    
var cleanString = Regex.Replace(stringToClean, pattern, string.Empty);

// normalise the white space in the string (one space at a time)
var normalisedString = Regex.Replace(cleanString, @"\s+", " ");

The first line makes a pattern that matches any of the words to remove. The second line replaces them all at once which saves needless iteration. The third line normalises the white space in the string.

5 Comments

Functionality is important but so is readability. You should consider your formatting. Less isn't always more.
@Jodrell Hi, thanks! But I'm getting the blank spaces between the words remaning. Any ideas? Regards.
@Patrick, thats because only the word is being replaced not the spaces. Like in your example.
@Patrick, I've added a third line to normalise the whitespace.
Hi, thanks for your help. I have choose Fung's answer for been the faster with a functional solution. Regards.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.