How can I strip HTML tags from a string in ASP.NET?

Question

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.

Example:

<ul><li>Hello</li></ul>

Output:

"Hello"

I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

I would imagine that PHP strip_tags uses regex behind the scenes! — stevehipwell
– stevehipwell, Commented Apr 24, 2009 at 13:02
@Daniel: because regex is very bad at that, especially if you have nesting. — Joel Coehoorn
– Joel Coehoorn, Commented Apr 24, 2009 at 13:03
Hmm, doesn't look like PHP's Strip_Tags is particularly reliable either going on the offical notes and the comments: uk.php.net/strip_tags — Zhaph - Ben Duguid
– Zhaph - Ben Duguid, Commented May 14, 2009 at 20:53
possible duplicate of RegEx match open tags except XHTML self-contained tags — Cole Tobin
– Cole Tobin, Commented Oct 12, 2013 at 20:39
Does this answer your question? How do I remove all HTML tags from a string without knowing which tags are in it? — Michael Freidgeim
– Michael Freidgeim, Commented Jun 22, 2021 at 0:32

jpaugh · Accepted Answer · 2020-01-14 17:32:45Z

117

If it is just stripping all HTML tags from a string, this works ~~reliably~~ with regex as well. Replace:

<[^>]*(>|$)

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.

Note:

There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.

edited Jan 14, 2020 at 17:32

jpaugh

7,1845 gold badges46 silver badges94 bronze badges

answered Apr 24, 2009 at 13:03

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Yahoo Serious Over a year ago

Although not requested, I think a lot of readers will want to also strip HTM-encoding, like &quote;. I combine it with WebUtility.HtmlDecode for that (which in turn will not remove tags). Use it after tag-removal, since it may rewrite > and <. E.g. WebUtility.HtmlDecode(Regex.Replace(myTextVariable, "<[^>]*(>|$)", string.Empty))

SearchForKnowledge Over a year ago

@YahooSerious Thank you for providing an example. This works great. Thank you.

Bojangles Over a year ago

Html Agility Pack is the way to go, I used it way back in webforms to strip entire web pages to use content!

user70568 Over a year ago

@YahooSerious this will allow a XSS vector in however > script < alert("XXS"); > / script < Will not be sanitized by the regex but converted by HtmlDecode to <script>alert("XXS");</ script>

Tomalak Over a year ago

@Heather Very good point. HTML tag stripping would have to be done again after entity decoding.

|

James McCormack · Accepted Answer · 2011-09-16 15:25:56Z

78

Go download HTMLAgilityPack, now! ;) Download LInk

This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.

Here is a sample:

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerText;
            }

edited Sep 16, 2011 at 15:25

James McCormack

10.1k4 gold badges56 silver badges66 bronze badges

answered May 14, 2009 at 20:33

Serapth

7,1824 gold badges33 silver badges39 bronze badges

3 Comments

jessehouwing Over a year ago

you can even query every text() node, trim the contents and string.Join those with space. IEnumerable<string> allText = doc.DocumentNode.SelectNodes("//text()").Select(n => n.InnerText.Trim())

jessehouwing Over a year ago

or simply use doc.DocumentNode.InnerText, though this has some issues with whitespacehandling it seems...

avesse Over a year ago

Why the if (doc == null) check? This is always false, not so?

Ben · Accepted Answer · 2010-10-13 22:51:46Z

70

Regex.Replace(htmlText, "<.*?>", string.Empty);

edited Oct 13, 2010 at 22:51

Ben

16.6k9 gold badges47 silver badges65 bronze badges

answered Apr 24, 2009 at 13:06

user95144

1,3678 silver badges7 bronze badges

2 Comments

ChrisF Over a year ago

Has many issues - doesn't deal with attributes having < or > in them and doesn't do well with tags that span more than one line unless run with RegexOptions.SingleLine.

Paul Kienitz Over a year ago

Noooo, use "<[^>]*>".

Fred · Accepted Answer · 2013-10-22 09:52:26Z

11

protected string StripHtml(string Txt)
{
    return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}    

Protected Function StripHtml(Txt as String) as String
    Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

edited Oct 22, 2013 at 9:52

Fred

13k7 gold badges65 silver badges90 bronze badges

answered Mar 2, 2012 at 18:22

meramez

1191 silver badge3 bronze badges

1 Comment

ChrisF Over a year ago

Doesn't work for lots of cases including non-unix linebreaks.

Michael Tipton · Accepted Answer · 2009-11-05 17:16:51Z

6

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable. In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

answered Nov 5, 2009 at 17:16

Michael Tipton

1011 silver badge1 bronze badge

2 Comments

Axarydax Over a year ago

this doesn't seem to work, I tested it with simple InnerHtml="<b>foo</b>"; and InnerText has value "<b>foo</b>" :(

saille Over a year ago

Don't do this. This solution injects un-encoded html directly into the output. This would leave you wide open to Cross Site Scripting attacks - you have just allowed anyone that can change the html string to inject any arbitrary html and javascript into your application!

Oleks · Accepted Answer · 2011-06-08 15:50:47Z

5

I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.

Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.

Please read the related article on CodeProject.

Thank you.

edited Jun 8, 2011 at 15:50

Oleks

32.4k11 gold badges80 silver badges134 bronze badges

answered Apr 24, 2009 at 17:54

Andrei Rînea

20.9k18 gold badges124 silver badges172 bronze badges

Comments

Bucket · Accepted Answer · 2012-11-05 11:55:38Z

For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.

public static string RemoveHTMLTags(string content)
    {
        var cleaned = string.Empty;
        try
        {
            StringBuilder textOnly = new StringBuilder();
            using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
            {
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Text)
                        textOnly.Append(reader.ReadContentAsString());
                }
            }
            cleaned = textOnly.ToString();
        }
        catch
        {
            //A tag is probably not closed. fallback to regex string clean.
            string textOnly = string.Empty;
            Regex tagRemove = new Regex(@"<[^>]*(>|$)");
            Regex compressSpaces = new Regex(@"[\s\r\n]+");
            textOnly = tagRemove.Replace(content, string.Empty);
            textOnly = compressSpaces.Replace(textOnly, " ");
            cleaned = textOnly;
        }

        return cleaned;
    }

Oleks · Accepted Answer · 2011-06-08 15:50:21Z

3

string result = Regex.Replace(anytext, @"<(.|\n)*?>", string.Empty);

edited Jun 8, 2011 at 15:50

Oleks

32.4k11 gold badges80 silver badges134 bronze badges

answered May 14, 2009 at 20:26

Ahmet BUTUN

Comments

saille · Accepted Answer · 2015-05-27 23:49:20Z

I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.

So I propose the method below.

Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.

public static string RemoveHtmlTags(this string html)
{
        if (String.IsNullOrEmpty(html))
            return html;

        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(html);

        if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
        {
            return WebUtility.HtmlDecode(html);
        }

        var sb = new StringBuilder();

        var i = 0;

        foreach (var node in doc.DocumentNode.ChildNodes)
        {
            var text = node.InnerText.SafeTrim();

            if (!String.IsNullOrEmpty(text))
            {
                sb.Append(text);

                if (i < doc.DocumentNode.ChildNodes.Count - 1)
                {
                    sb.Append(Environment.NewLine);
                }
            }

            i++;
        }

        var result = sb.ToString();

        return WebUtility.HtmlDecode(result);
}

public static string SafeTrim(this string str)
{
    if (str == null)
        return null;

    return str.Trim();
}

If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.

If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).

Annie · Accepted Answer · 2013-11-18 01:09:09Z

For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:

public static string StripTags(this string markup)
{
    try
    {
        StringReader sr = new StringReader(markup);
        XPathDocument doc;
        using (XmlReader xr = XmlReader.Create(sr,
                           new XmlReaderSettings()
                           {
                               ConformanceLevel = ConformanceLevel.Fragment
                               // for multiple roots
                           }))
        {
            doc = new XPathDocument(xr);
        }

        return doc.CreateNavigator().Value; // .Value is similar to .InnerText of  
                                           //  XmlDocument or JavaScript's innerText
    }
    catch
    {
        return string.Empty;
    }
}

Karan · Accepted Answer · 2017-03-17 06:58:38Z

1

using System.Text.RegularExpressions;

string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);

edited Mar 17, 2017 at 6:58

user3559349

answered Mar 17, 2017 at 6:33

Karan

474 bronze badges

Comments

Yepeekai · Accepted Answer · 2019-11-21 16:18:25Z

1

You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.

var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();

You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.

edited Nov 21, 2019 at 16:18

answered Nov 21, 2019 at 16:04

Yepeekai

2,79130 silver badges24 bronze badges

Comments

Yuksel Daskin · Accepted Answer · 2016-04-07 09:00:55Z

For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:

public string StripTags(HtmlNode documentNode, IList keepTags)
{
    var result = new StringBuilder();
        foreach (var childNode in documentNode.ChildNodes)
        {
            if (childNode.Name.ToLower() == "#text")
            {
                result.Append(childNode.InnerText);
            }
            else
            {
                if (!keepTags.Contains(childNode.Name.ToLower()))
                {
                    result.Append(StripTags(childNode, keepTags));
                }
                else
                {
                    result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
                }
            }
        }
        return result.ToString();
    }

More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/

p.s.w.g · Accepted Answer · 2014-05-14 21:44:11Z

-5

Simply use string.StripHTML();

edited May 14, 2014 at 21:44

p.s.w.g

150k31 gold badges307 silver badges339 bronze badges

answered May 14, 2014 at 21:09

user3638478

1

1 Comment

Sven Grosen Over a year ago

As @Serpiton points out, there isn't such a method in the BCL. Could you point to an implementation of this method or provide your own?

Collectives™ on Stack Overflow

How can I strip HTML tags from a string in ASP.NET?

Example:

Output:

14 Answers 14

6 Comments

3 Comments

2 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example:

Output:

14 Answers 14

6 Comments

3 Comments

2 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related