41

How can I remove HTML tags from the following string?

<P style="MARGIN: 0cm 0cm 10pt" class=MsoNormal><SPAN style="LINE-HEIGHT: 115%; 
FONT-FAMILY: 'Verdana','sans-serif'; COLOR: #333333; FONT-SIZE: 9pt">In an 
email sent just three days before the Deepwater Horizon exploded, the onshore 
<SPAN style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> manager in charge of 
the drilling rig warned his supervisor that last-minute procedural changes were 
creating "chaos". April emails were given to government investigators by <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> and reviewed by The Wall 
Street Journal and are the most direct evidence yet that workers on the rig 
were unhappy with the numerous changes, and had voiced their concerns to <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN>’s operations managers in 
Houston. This raises further questions about whether <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> managers properly 
considered the consequences of changes they ordered on the rig, an issue 
investigators say contributed to the disaster.</SPAN></p><br/>  

I'm writing it to Asponse.PDF, but the HTML tags are shown in the PDF. How can I remove them?

4
  • 1
    i tried HTMLDecode, didn't work Commented Feb 2, 2011 at 18:43
  • You need to HTML encode to escape the tags. Commented Feb 2, 2011 at 18:44
  • 1
    Do you want to strip the tags or apply the formatting? Commented Feb 2, 2011 at 18:46
  • 2
    dotnetperls.com/remove-html-tags Commented Jun 18, 2012 at 15:47

2 Answers 2

105

Warning: This does not work for all cases and should not be used to process untrusted user input.

using System.Text.RegularExpressions;
...
const string HTML_TAG_PATTERN = "<.*?>";

static string StripHTML (string inputString)
{
   return Regex.Replace 
     (inputString, HTML_TAG_PATTERN, string.Empty);
}
Sign up to request clarification or add additional context in comments.

11 Comments

-1 You shouldn't use a regular expression to parse a context-free grammar like HTML. If the HTML is being provided by some external entity, then it can be easily manipulated to evade your regular expression.
public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } It is about 8x time faster than Regex
@capdragon Furthermore, people extrapolate from examples they see on SO. Eventually somebody will read this and try to rewrite it to just remove <script> tags, and they will not realize that's it's especially not suitable for XSS-prevention (since it can easily be tricked). On SO, I believe solutions should be written for the general audience reading it, not just for the single person who asked the question. (Otherwise, why post the question and answer publicly in the first place?)
If you want valid HTML5, how about <p data-foo=">">Bar</script>? But keep in mind that some people will use your code to process HTML of unknown provenance, and that HTML is not guaranteed to be valid! I would support your answer if you prefaced it with, "Warning: This does not work for all cases and should not be used to process untrusted user input." I suspect you have 58 upvotes because 58 people (living and dead) on the planet either aren't aware of or don't mind the test cases in which your solution is incorrect.
@mehaase fair enough. I made the changes, thanks.
|
10

You should use the HTML Agility Pack:

HtmlDocument doc = ...
string text = doc.DocumentElement.InnerText;

3 Comments

I really don't see why people give the answer to use the Agility Pack, since .InnerText of the body (as an example) doesn't render a markup-free string. There are plenty of people on SO who get the Agility Pack then wonder why they're still staring at markup, script tags.
Seemed to work for me pretty well. Certainly more elegant than any of the solutions above.
This solutions just removes wrapping HTML tags, does not guarantee that all markup will be removed

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.