24

For example:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>title</title>
</head>
<body>
    <a href="aaa.asp?id=1"> I want to get this text </a>
    <div>
        <h1>this is my want!!</h1>
        <b>this is my want!!!</b>
    </div>
</body>
</html>

and the result is:

 I want to get this text 
this is my want!!
this is my want!!!
6
  • 1
    He basically wants to serialize the HTML it looks like... just strip all markup out and only be left with the data. Commented Jun 24, 2009 at 13:58
  • Not really relevant to the question, but something you should definitely know: closing tags have a "/" in them. For example, "<h1>this is my want!!</h1>" - note the "</h1>". Commented Jun 24, 2009 at 14:02
  • @Samir - I believe the point here is also to cater for malformed html - not just xhtml. Commented Jun 24, 2009 at 14:05
  • @Marc Gravell - Yes,that's the point Commented Jun 24, 2009 at 15:16
  • Use this link for your question stackoverflow.com/questions/19523913/… Commented Dec 22, 2014 at 9:24

6 Answers 6

31

HTML Agility Pack:

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    string s = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Sign up to request clarification or add additional context in comments.

6 Comments

HtmlDocument has no constructors and the doesn't contain method LoadHtml() or property DocumentNode...I am trying this code in VC2010 can you help me please?
@Ahmy are you sure you are using the agility pack?
Ohhhhhhh! Excuse me sir i didn't include agility pack ref...thanks Marc
i have another problem when applying your code that the apersand(&),@nbsp,gt,and lt characters still exist and causing to me incorrect text how can i eliminate like these characters
Mr.Marc i have banned from asking question and this was latterly.. and when i read bout this error message i found that i didn't commit anything of ban rules how can i ask about my problem?? is it an ethical way?
|
16

Use this function...

public string Strip(string text)
{
    return Regex.Replace(text, @"<(.|\n)*?>", string.Empty);
}

3 Comments

A better regex is <[^>]*> as the ? in that one makes it quite slow.
Ick, this question is repeated a lot across SO, and this same bad answer is repeated a lot, too. As I already said in another identical post: "You shouldn't use a regular expression to parse a context-free grammar like HTML. If the HTML is being provided by some external entity, then it can be easily manipulated to evade your regular expression."
@mehaase, I use this parse in some codes until that day (mar/2013). But actually I use the library "HtmlAgilityPack".
1

I would recommend using something like HTMLTidy.

Here's a tutorial on it to get you started.

Comments

0

Why do you want to make it server side?

For that you have to make the container element runat="server" and then take the innerText of the element.

You can do the same in javascript without making the element runat="server"

1 Comment

I am developing a news system, I would like to interception as a summary of some news content displayed in the Home
0

If you just want to remove the html tags then use a regular expression that deletes anything between "<" and ">".

1 Comment

I am a bit worried about the regex is so slow
0

You can start with this simple function below. Disclaimer: This code is suitable for basic HTML, but will not handle all valid HTML situations and edge cases. Tags within quotes is an example. The advantage of this code is you can easy follow the execution in a debugger, and it can be easy modified to fit edge cases specific to you.

public static string RemoveTags(string html)
    {
        string returnStr = "";
        bool insideTag = false;
        for (int i = 0; i < html.Length; ++i)
        {
            char c = html[i];
            if (c == '<')    
                insideTag = true;
            if (!insideTag)
                returnStr += c;
            if (c == '>')         
                insideTag = false;
        }
        return returnStr;        
    }

3 Comments

This is basically just an unrolled version of the regex answer above, and as such it's not any more robust. This would easily be thrown off, for example, by a quoted attribute that contains a ">", not to mention a pathological case like the one here: stackoverflow.com/questions/5175840/….
Will it pass <div title="x<4>" id="vectorizer"> text here <img class="foo"> text there</div>?
@Annie Unfortunately it will not work with tags contained within quotes as is. You could modify it to catch those types of edge cases.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.