Remove HTML tags from string including &nbsp in C#

Question

How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

Don't use a regex, check out the HTML Agility Pack. stackoverflow.com/questions/846994/how-to-use-html-agility-pack — Tim
– Tim, Commented Oct 22, 2013 at 16:58
Thanks Tim, but the application is quite big and intact, adding or downloading a html agility pack won't work. — rampuriyaaa
– rampuriyaaa, Commented Oct 22, 2013 at 17:00

Ravi K Thapliyal · Accepted Answer · 2013-10-22 17:08:21Z

214

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

answered Oct 22, 2013 at 17:08

Ravi K Thapliyal

51.9k9 gold badges80 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Don Rolling Over a year ago

I haven't yet tested this as much as I will need to, but it worked better than I expected it to work. I'll post the method I wrote below.

iCollect.it Ltd Over a year ago

A lazy match (<[^>]+?> as per @David S.) might make this a tad faster, but just used this solution in a live project - very happy +1 :)

Mahesh Malpani Over a year ago

Regex.Replace(inputHTML, @"<[^>]+>|&nbsp|\n;", "").Trim(); \n is not getting removed

Tauseef Over a year ago

I would recommend to ad a space rather than an empty string, we are catching out extra spaces any way Regex.Replace(inputHTML, @"<[^>]+>| ", " ")

Ravi K Thapliyal Over a year ago

@Tauseef If you use a space in the first replace call, you may end up leaving spaces where there were none in the original input. Say you receive Sound<b>Cloud</b> as an input; you'll end up with Sound Cloud while it should've been stripped as SoundCloud because that's how it gets displayed in HTML.

|

Don Rolling · Accepted Answer · 2014-07-31 14:50:46Z

34

I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

answered Jul 31, 2014 at 14:50

Don Rolling

2,3494 gold badges31 silver badges27 bronze badges

Comments

David S. · Accepted Answer · 2013-10-22 17:14:30Z

17

I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

answered Oct 22, 2013 at 17:14

David S.

6,1432 gold badges45 silver badges83 bronze badges

3 Comments

Jimmy Over a year ago

Just to confirm: the SingleSpacedTrim() function does the same thing as string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " "); from Ravi Thapliyal's answer?

David S. Over a year ago

@Jimmy as far as I can see, that regex doesn't catch single tabs or newlines like SingleSpacedTrim() does. That could be a desirable effect though, in that case just remove the cases as needed.

ArgisIsland Over a year ago

Nice, but it seems to replace single and double quotes with blank spaces as well, although they are not in the "notOkCharacter" list, or am I missing something there? Is this part of the Decoding/Encoding meethods called at the beginning? What would be necessary to keep these characters intact?

MRP · Accepted Answer · 2014-06-11 06:27:50Z

4

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

answered Jun 11, 2014 at 6:27

MRP

6591 gold badge9 silver badges29 bronze badges

Comments

Sabique A Khan · Accepted Answer · 2019-04-09 05:02:45Z

2

I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the &nbsp with empty string but instead &nbsp should be replaced with space, so added an additional step. It worked for me like a charm.

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

Used &nbps without semicolon because it was getting formatted by the Stack Overflow.

answered Apr 9, 2019 at 5:02

Sabique A Khan

551 gold badge1 silver badge12 bronze badges

Comments

Ehsan88 · Accepted Answer · 2016-01-04 19:54:16Z

1

Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer

answered Jan 4, 2016 at 19:54

Ehsan88

3,8315 gold badges33 silver badges55 bronze badges

2 Comments

Revious Over a year ago

I think it's more agains XSS attacks than to normalize html

Ehsan88 Over a year ago

@Revious I think you are right. Maybe my answer is not related much to the OP's question as they did not mention the purpose of removing html tags. But if the purpose is to prevent attacks, as it is in many cases, then using an already developed sanitizer may be a better approach. BTW I have no knowledge about what the meaning of normalizing html is.

Jonesopolis · Accepted Answer · 2013-10-22 17:08:10Z

0

this:

(<.+?> | &nbsp;)

will match any tag or  

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

answered Oct 22, 2013 at 17:08

Jonesopolis

25.4k12 gold badges72 silver badges115 bronze badges

Comments

nivs1978 · Accepted Answer · 2018-05-16 06:54:00Z

0

HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like <   all in one go.

answered May 16, 2018 at 6:54

nivs1978

1,31516 silver badges22 bronze badges

Comments

mymiracl · Accepted Answer · 2022-08-09 04:40:54Z

0

i'm using this syntax for remove html tags with  

SessionTitle:result[i].sessionTitle.replace(/<[^>]+>|&**nbsp**;/g, '')

--Remove(*) **nbsp**

edited Aug 9, 2022 at 4:40

mymiracl

5791 gold badge16 silver badges24 bronze badges

answered Aug 3, 2022 at 4:39

Rohit Ratna

11 bronze badge

Comments

FelixSFD · Accepted Answer · 2017-02-10 17:59:44Z

-1

(<([^>]+)>|&nbsp;)

You can test it here: https://regex101.com/r/kB0rQ4/1

edited Feb 10, 2017 at 17:59

FelixSFD

6,10110 gold badges46 silver badges134 bronze badges

answered Feb 10, 2017 at 17:58

Ananth Ram

1

Collectives™ on Stack Overflow

Remove HTML tags from string including &nbsp in C#

10 Answers 10

12 Comments

Comments

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

12 Comments

Comments

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related