Remove words in string from words in array with c#

Question

I need to remove words from a string based on a set of words:

Words I want to remove:

DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND

If I receive a string like:

EDIT: This string is already "cleaned" from any symbols

THIS IS AN AMAZING WEBSITE AND LAYOUT

The result should be:

THIS IS AMAZING WEBSITE LAYOUT

So far I have:

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });

    string pattern = "";

    foreach (string word in splitWords)
    {
        pattern = @"\b" + word + "\b";
        stringToClean = Regex.Replace(stringToClean, pattern, "");
    }

    return stringToClean;
}

But it's not removing the words, any idea?

I don't know if I'm using the most eficient way to do it, maybe put the words in a array just to avoid spliting them all the time?

Thanks

I don't know C# that well but should the second "\b" have a @ in front? — user21926
– user21926, Commented Jul 16, 2013 at 14:09
To all the answerers whose solution's support just this example, you could just do return "THIS IS AMAZING WEBSITE LAYOUT"; — Jodrell
– Jodrell, Commented Jul 16, 2013 at 14:31
@Jodrell, but you always have som special preconditions, such as no special characters. — Viktor Mellgren
– Viktor Mellgren, Commented Jul 16, 2013 at 14:56

Fung · Accepted Answer · 2013-07-16 14:36:28Z

9

private static List<string> wordsToRemove =
    "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ').ToList();

public static string StringWordsRemove(string stringToClean)
{
    return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove));
}

Modification to handle punctuations:

public static string StringWordsRemove(string stringToClean)
{
    // Define how to tokenize the input string, i.e. space only or punctuations also
    return string.Join(" ", stringToClean
        .Split(new[] { ' ', ',', '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries)
        .Except(wordsToRemove));
}

edited Jul 16, 2013 at 14:36

answered Jul 16, 2013 at 14:23

Fung

3,5582 gold badges28 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jodrell Over a year ago

but, what if stringToClean has punctuation?

Patrick Over a year ago

Hi, thanks for your help. I have choose your answer for been the faster, with a no iteration's solution. Regards.

Jodrell Over a year ago

what about all the punctuation like ", £, $, %, ^, &, (, ), -, _, +, =, [, ], {, }, :; ;, @, #, ~ etc. etc.

Fung Over a year ago

@Jodrell, If you have a very limited set, you can plug them all in the modified verion's Split() call, though the OP said he has removed them from the input already. For the sake of discussion, I'd suggest to solve the problem in 2 steps: 1) preprocess the string to remove any punctuations, 2) tokenize and remove the unwanted words. For 1), you can check the answer in here.

Jodrell Over a year ago

@Patrick, I did a performance test on my system, with your test data, this Linq method is about 4x faster that the Regex approach in my answer. +1 from me. Test code available if anyboy is interested. I'd suspect there might be some variation as stringToClean grows but that wasn't the question.

|

Shaharyar · Accepted Answer · 2013-07-16 14:18:43Z

1

I just changed this line

pattern = @"\b" + word + "\b";

to this

pattern = @"\b" + word + @"\b"; //added '@'

and I got the result

THIS IS AMAZING WEBSITE LAYOUT

and it would be better if you use String.Empty instead of "" like:

stringToClean = Regex.Replace(stringToClean, pattern, String.Empty);

edited Jul 16, 2013 at 14:18

answered Jul 16, 2013 at 14:11

Shaharyar

12.5k4 gold badges50 silver badges70 bronze badges

4 Comments

Jodrell Over a year ago

I agree with you points but you could reduce iteration by creating a unified expression. stackoverflow.com/a/17679108/659190

Patrick Over a year ago

Hi, thanks for your help. I have choose @Fung's answer for been the faster, and with no iteration. Regards.

Jodrell Over a year ago

@Patrick Fung's answer performs the iteration when you evaluate the Except.

Patrick Over a year ago

@Jodrell, Sorry I didn't know.

Lotok · Accepted Answer · 2013-07-16 14:18:54Z

1

I used LINQ

string exceptions = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND";
string[] exceptionsList = exceptions.Split(' ');

string test  ="THIS IS AN AMAZING WEBSITE AND LAYOUT";
string[] wordList = test.Split(' ');

string final = null;
var result = wordList.Except(exceptionsList).ToArray();
final = String.Join(" ",result);

Console.WriteLine(final);

answered Jul 16, 2013 at 14:18

Lotok

4,6151 gold badge37 silver badges46 bronze badges

4 Comments

Viktor Mellgren Over a year ago

That's beautifully done! Just as explicit and accurate as functional programming should be!

Jodrell Over a year ago

however, if the stringToClean contains word boundries that are not spaces, like ',', '.', '?', '"', ... you are in a world of pain. Note, this set of word boundries is large and growing.

Viktor Mellgren Over a year ago

more feedback then: Just do return String.Join(" ",result);

Patrick Over a year ago

Hi, thanks for your help. I have choose @Fung's answer for been the faster, with a no iteration's solution. Regards.

Anderung · Accepted Answer · 2013-07-16 14:18:40Z

0

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
    string pattern = " (" + string.Join("|", splitWords) + ") ";
    string cleaned=Regex.Replace(stringToClean, pattern, " ");
    return cleaned;
}

answered Jul 16, 2013 at 14:18

Anderung

313 bronze badges

2 Comments

Jodrell Over a year ago

like my answer but later.

Patrick Over a year ago

Hi, thanks for your help. I have choose Fung's answer for been the faster with a functional solution. Regards.

Dr Schizo · Accepted Answer · 2013-07-16 14:23:40Z

0

Output you get "THIS IS AMAZING WEBSITE LAYOUT".

I was getting an issue where by it was leaving the word "D" (so it was THIS IS AN AMAZING WEBSITE D LAYOUT) in the result because if you use replace it replaces only a certain part of the word. This removed the entire word if the characters you defined are detected (I imagine this is what you want?).

        string[] tabooWords = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ');
        string text = "THIS IS AN AMAZING WEBSITE AND LAYOUT";
        string result = text;

        foreach (string word in text.Split(' '))
        {
            if (tabooWords.Contains(word.ToUpper()))
            {
                int start = result.IndexOf(word);
                result = result.Remove(start, word.Length);
            }
        }

edited Jul 16, 2013 at 14:23

answered Jul 16, 2013 at 14:17

Dr Schizo

4,4887 gold badges47 silver badges90 bronze badges

2 Comments

Jodrell Over a year ago

won't this strip all the As, Es and Os etc?

Patrick Over a year ago

Hi, thanks for your help. I have choose your answer for been the faster, with a no iteration's solution and that I can user with any WordsToRemoveStrin. Regards.

James R. · Accepted Answer · 2013-07-16 14:27:28Z

0

Or...

stringToClean = Regex.Replace(stringToClean, @"\bDE\b|\bDA\b|\bDAS\b|\bDO\b|\bDOS\b|\bAN\b|\bNAS\b|\bNO\b|\bNOS\b|\bEM\b|\bE\b|\bA\b|\bAS\b|\bO\b|\bOS\b|\bAO\b|\bAOS\b|\bP\b|\bLDA\b|\bAND\b", String.Empty);
stringToClean = Regex.Replace(stringToClean, "  ", String.Empty);

answered Jul 16, 2013 at 14:27

James R.

8508 silver badges17 bronze badges

3 Comments

Jodrell Over a year ago

erm, why not type @"\b(DE|DA|DAS|DO|DOS|AN|NAS|NO|NOS|EM|E|A|AS|O|OS|AO|OS|P|LDA|AND)\b"

James R. Over a year ago

@Jodrell - Because, that would be too easy. :) Thanks.

Patrick Over a year ago

Hi, thanks for your help. I have choose Fung's answer for been the faster, with a no iteration's solution and that I can use with any WordsToRemoveString. Regards.

Jodrell · Accepted Answer · 2013-07-16 15:08:33Z

0

how about,

// make a pattern to match all words 
var pattern = string.Format(
    @"\b({0})\b",
    string.Join("|", wordsToremove.Split(new[] { ' ' })));

// pattern will be of the form "\b(badword1|badword2|...)\b"

// remove all the bad words from the string in one go.    
var cleanString = Regex.Replace(stringToClean, pattern, string.Empty);

// normalise the white space in the string (one space at a time)
var normalisedString = Regex.Replace(cleanString, @"\s+", " ");

The first line makes a pattern that matches any of the words to remove. The second line replaces them all at once which saves needless iteration. The third line normalises the white space in the string.

edited Jul 16, 2013 at 15:08

answered Jul 16, 2013 at 14:20

Jodrell

35.9k6 gold badges94 silver badges131 bronze badges

5 Comments

Lotok Over a year ago

Functionality is important but so is readability. You should consider your formatting. Less isn't always more.

Patrick Over a year ago

@Jodrell Hi, thanks! But I'm getting the blank spaces between the words remaning. Any ideas? Regards.

Jodrell Over a year ago

@Patrick, thats because only the word is being replaced not the spaces. Like in your example.

Jodrell Over a year ago

@Patrick, I've added a third line to normalise the whitespace.

Patrick Over a year ago

Hi, thanks for your help. I have choose Fung's answer for been the faster with a functional solution. Regards.

Collectives™ on Stack Overflow

Remove words in string from words in array with c#

7 Answers 7

6 Comments

4 Comments

4 Comments

2 Comments

2 Comments

3 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

6 Comments

4 Comments

4 Comments

2 Comments

2 Comments

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related