I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).