31

What is the best way to get a plain text string from an HTML string?

public string GetPlainText(string htmlString)
{
    // any .NET built in utility?
}

Thanks in advance

2
  • What do you mean by plain text? Commented May 3, 2011 at 13:48
  • @slandau: I want to output readable text from an HTML input. I'm not sure if something additional to remove the tags... Commented May 3, 2011 at 13:52

4 Answers 4

46

You can use MSHTML, which can be pretty forgiving;

//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });

string txt = htmldoc2.body.outerText;

Plateau of Leng 2 sugars please what? & who?

Sign up to request clarification or add additional context in comments.

6 Comments

Works like a charm! Should be the accepted answer. Note that you need to add reference to Microsoft.mshtml.dll first.
Are you sure this method is safe with HTML from untrusted sources? Does HTMLDocument.Write() execute passed scripts?
This answer is far more robust than the accepted answer (that uses just simple regex to remove tags) and is probably necessary for pages with any reasonable complexity.
@giladmayani You could use the accepted answer here stackoverflow.com/a/19414886/1911540 to strip out any <script> tags (and their inner contents) before using this method. This would also lead to downstream string operation performance increases as the Javascript contents can be quite lengthy.
@SpecialSauce that is true, but don't forget that technically, Javascript can exist not only in <script> tags, E.G as attribute values: <button onclick="someJavascript();">
|
25

There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");

3 Comments

@Andrey Haha that's a pretty awesome accepted answer. Luckily the OP didn't state exact requirements nor define the HTML string so this should catch most actual HTML cases, rather than XHTML.
Regex still doesn't necessarily yield final result. You need to convert at least &lt;, &gt; and &amp;. If your text contains other HTML character entities like &scaron; (š) you need to decode all of them as well.
1

Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

Return HttpUtility.HtmlDecode(
                Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                )

This removes all the tags, and then decodes any of the extras like &lt; or &gt;

Comments

0

There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.

If you need to remove more than just the tags (i.e., turn &acirc; to â), you can use a more elaborated solution, like found here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.