What is the best way to get a plain text string from an HTML string?
public string GetPlainText(string htmlString)
{
// any .NET built in utility?
}
Thanks in advance
What is the best way to get a plain text string from an HTML string?
public string GetPlainText(string htmlString)
{
// any .NET built in utility?
}
Thanks in advance
You can use MSHTML, which can be pretty forgiving;
//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? & who?" });
string txt = htmldoc2.body.outerText;
Plateau of Leng 2 sugars please what? & who?
Microsoft.mshtml.dll first.<script> tags (and their inner contents) before using this method. This would also lead to downstream string operation performance increases as the Javascript contents can be quite lengthy.<script> tags, E.G as attribute values: <button onclick="someJavascript();">There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:
string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");
<, > and &. If your text contains other HTML character entities like š (š) you need to decode all of them as well.There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.
If you need to remove more than just the tags (i.e., turn â to â), you can use a more elaborated solution, like found here.