Get plain text from HTML in .NET

Question

What is the best way to get a plain text string from an HTML string?

public string GetPlainText(string htmlString)
{
    // any .NET built in utility?
}

Thanks in advance

@slandau: I want to output readable text from an HTML input. I'm not sure if something additional to remove the tags... — Daniel Peñalba
– Daniel Peñalba, Commented May 3, 2011 at 13:52

Alex K. · Accepted Answer · 2011-05-03 14:59:58Z

46

You can use MSHTML, which can be pretty forgiving;

//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });

string txt = htmldoc2.body.outerText;

Plateau of Leng 2 sugars please what? & who?

answered May 3, 2011 at 14:59

Alex K.

177k32 gold badges276 silver badges299 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sinan ILYAS Over a year ago

Works like a charm! Should be the accepted answer. Note that you need to add reference to Microsoft.mshtml.dll first.

gilad905 Over a year ago

Are you sure this method is safe with HTML from untrusted sources? Does HTMLDocument.Write() execute passed scripts?

Special Sauce Over a year ago

This answer is far more robust than the accepted answer (that uses just simple regex to remove tags) and is probably necessary for pages with any reasonable complexity.

Special Sauce Over a year ago

@giladmayani You could use the accepted answer here stackoverflow.com/a/19414886/1911540 to strip out any <script> tags (and their inner contents) before using this method. This would also lead to downstream string operation performance increases as the Javascript contents can be quite lengthy.

gilad905 Over a year ago

@SpecialSauce that is true, but don't forget that technically, Javascript can exist not only in <script> tags, E.G as attribute values: <button onclick="someJavascript();">

|

Rudi Visser · Accepted Answer · 2011-05-03 14:02:26Z

25

There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");

edited May 3, 2011 at 14:02

answered May 3, 2011 at 13:48

Rudi Visser

22.1k6 gold badges74 silver badges97 bronze badges

3 Comments

Andrey Over a year ago

check this epic question stackoverflow.com/questions/1732348/…

Rudi Visser Over a year ago

@Andrey Haha that's a pretty awesome accepted answer. Luckily the OP didn't state exact requirements nor define the HTML string so this should catch most actual HTML cases, rather than XHTML.

miroxlav Over a year ago

Regex still doesn't necessarily yield final result. You need to convert at least <, > and &. If your text contains other HTML character entities like &scaron; (š) you need to decode all of them as well.

user1641172 · Accepted Answer · 2015-08-17 15:37:52Z

1

Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

Return HttpUtility.HtmlDecode(
                Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                )

This removes all the tags, and then decodes any of the extras like < or >

answered Aug 17, 2015 at 15:37

user1641172

Comments

Erick Petrucelli · Accepted Answer · 2011-05-03 13:53:36Z

0

There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.

If you need to remove more than just the tags (i.e., turn â to â), you can use a more elaborated solution, like found here.

answered May 3, 2011 at 13:53

Erick Petrucelli

15.1k9 gold badges68 silver badges89 bronze badges

Collectives™ on Stack Overflow

Get plain text from HTML in .NET

4 Answers 4

6 Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related