How to clean HTML tags using C#

Question

For example:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>title</title>
</head>
<body>
    <a href="aaa.asp?id=1"> I want to get this text </a>
    <div>
        <h1>this is my want!!</h1>
        <b>this is my want!!!</b>
    </div>
</body>
</html>

and the result is:

 I want to get this text 
this is my want!!
this is my want!!!

He basically wants to serialize the HTML it looks like... just strip all markup out and only be left with the data. — DigitalZebra
– DigitalZebra, Commented Jun 24, 2009 at 13:58
Not really relevant to the question, but something you should definitely know: closing tags have a "/" in them. For example, "<h1>this is my want!!</h1>" - note the "</h1>". — Samir Talwar
– Samir Talwar, Commented Jun 24, 2009 at 14:02
@Samir - I believe the point here is also to cater for malformed html - not just xhtml. — Marc Gravell
– Marc Gravell, Commented Jun 24, 2009 at 14:05
Use this link for your question stackoverflow.com/questions/19523913/… — dang.khoa.1989.2010
– dang.khoa.1989.2010, Commented Dec 22, 2014 at 9:24

Marc Gravell · Accepted Answer · 2009-06-24 13:54:02Z

31

HTML Agility Pack:

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    string s = doc.DocumentNode.SelectSingleNode("//body").InnerText;

answered Jun 24, 2009 at 13:54

Marc Gravell

1.1m273 gold badges2.6k silver badges3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ahmy Over a year ago

HtmlDocument has no constructors and the doesn't contain method LoadHtml() or property DocumentNode...I am trying this code in VC2010 can you help me please?

Marc Gravell Over a year ago

@Ahmy are you sure you are using the agility pack?

Ahmy Over a year ago

Ohhhhhhh! Excuse me sir i didn't include agility pack ref...thanks Marc

Ahmy Over a year ago

i have another problem when applying your code that the apersand(&),@nbsp,gt,and lt characters still exist and causing to me incorrect text how can i eliminate like these characters

Ahmy Over a year ago

Mr.Marc i have banned from asking question and this was latterly.. and when i read bout this error message i found that i didn't commit anything of ban rules how can i ask about my problem?? is it an ethical way?

|

Eric J. · Accepted Answer · 2013-03-05 01:09:26Z

16

Use this function...

public string Strip(string text)
{
    return Regex.Replace(text, @"<(.|\n)*?>", string.Empty);
}

edited Mar 5, 2013 at 1:09

Eric J.

151k65 gold badges353 silver badges563 bronze badges

answered Dec 14, 2009 at 15:43

diegodsp

9409 silver badges13 bronze badges

3 Comments

ChrisF Over a year ago

A better regex is <[^>]*> as the ? in that one makes it quite slow.

Mark E. Haase Over a year ago

Ick, this question is repeated a lot across SO, and this same bad answer is repeated a lot, too. As I already said in another identical post: "You shouldn't use a regular expression to parse a context-free grammar like HTML. If the HTML is being provided by some external entity, then it can be easily manipulated to evade your regular expression."

diegodsp Over a year ago

@mehaase, I use this parse in some codes until that day (mar/2013). But actually I use the library "HtmlAgilityPack".

Ólafur Waage · Accepted Answer · 2009-06-24 13:44:34Z

1

I would recommend using something like HTMLTidy.

Here's a tutorial on it to get you started.

answered Jun 24, 2009 at 13:44

Ólafur Waage

70.3k22 gold badges147 silver badges199 bronze badges

Comments

rahul · Accepted Answer · 2009-06-24 13:50:58Z

0

Why do you want to make it server side?

For that you have to make the container element runat="server" and then take the innerText of the element.

You can do the same in javascript without making the element runat="server"

answered Jun 24, 2009 at 13:50

rahul

188k50 gold badges239 silver badges266 bronze badges

1 Comment

guaike Over a year ago

I am developing a news system, I would like to interception as a summary of some news content displayed in the Home

Andrew Marsh · Accepted Answer · 2009-06-24 15:04:51Z

0

If you just want to remove the html tags then use a regular expression that deletes anything between "<" and ">".

answered Jun 24, 2009 at 15:04

Andrew Marsh

2,10215 silver badges14 bronze badges

1 Comment

guaike Over a year ago

I am a bit worried about the regex is so slow

James Lawruk · Accepted Answer · 2013-11-18 16:16:32Z

0

You can start with this simple function below. Disclaimer: This code is suitable for basic HTML, but will not handle all valid HTML situations and edge cases. Tags within quotes is an example. The advantage of this code is you can easy follow the execution in a debugger, and it can be easy modified to fit edge cases specific to you.

public static string RemoveTags(string html)
    {
        string returnStr = "";
        bool insideTag = false;
        for (int i = 0; i < html.Length; ++i)
        {
            char c = html[i];
            if (c == '<')    
                insideTag = true;
            if (!insideTag)
                returnStr += c;
            if (c == '>')         
                insideTag = false;
        }
        return returnStr;        
    }

edited Nov 18, 2013 at 16:16

answered May 25, 2010 at 16:54

James Lawruk

31.5k19 gold badges135 silver badges141 bronze badges

3 Comments

Mark E. Haase Over a year ago

This is basically just an unrolled version of the regex answer above, and as such it's not any more robust. This would easily be thrown off, for example, by a quoted attribute that contains a ">", not to mention a pathological case like the one here: stackoverflow.com/questions/5175840/….

Annie Over a year ago

Will it pass <div title="x<4>" id="vectorizer"> text here <img class="foo"> text there</div>?

James Lawruk Over a year ago

@Annie Unfortunately it will not work with tags contained within quotes as is. You could modify it to catch those types of edge cases.

Collectives™ on Stack Overflow

How to clean HTML tags using C#

6 Answers 6

6 Comments

3 Comments

Comments

1 Comment

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

3 Comments

Comments

1 Comment

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related