How to parse HTML using HTML Agility pack

Question

I am using regex to parse HTML but some article says that HTMLAgilityPack is much easier.The big question for me is how to parse html for this sample (twitter):

This the HTML code:

<p class="js-tweet-text tweet-text"> What an awesome day! Adventure nanaman kahapon <a href="http" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b><strong>ondoy</strong></b></a> <a href="https://twitter.com/search?q=%23eurotel&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b>eurotel</b></a> <a href="https://twitter.com/search?q=%23retail&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b>retail</b></a> <a href="https://twitter.com/search?q=%23family&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b>family</b></a></p>

and I want it to output like this:

"What an awesome day! Adventure nanaman kahapon #ondoy #eurotel #retail #family"

How do I parse that html code. I am using regex now but it displays other tags like href.

this is my regex code.

           WebClient web = new WebClient(); 
           string html = web.DownloadString(filename);

            MatchCollection m1 = Regex.Matches(html, "<p class=\"js-tweet-text tweet-text\">\\s*(.+?)\\s*</p>", RegexOptions.Singleline);
            foreach (Match m in m1)
            {
                MessageBox.Show(m.Groups[1].Value);
            }

you should use a combination of HTML agility pack and Regex. Html agility pack to get html and regex to parse the data that you need to parse within the html like the above tags — Ehsan
– Ehsan, Commented Jul 21, 2013 at 11:02

Thousand · Accepted Answer · 2013-07-21 11:17:42Z

5

     HtmlWeb p = new HtmlWeb();
     var doc= p.Load(@"link your HTML page");
     var node =  doc.DocumentNode.SelectNodes("//p[@class='js-tweet-text tweet-text']").FirstOrDefault();

     if (node != null)
     {
       Console.WriteLine(node.InnerText);
     }

i just tested it myself and this prints out

What an awesome day! Adventure nanaman kahapon #ondoy #eurotel #retail #family"

Do note that if you are going to run this piece of code on an actual twitterpage, there are going to be multiple tweets, so you will need to make some modifications to the code posted above. But this should give you a good idea on how to use it.

answered Jul 21, 2013 at 11:17

Thousand

6,6484 gold badges42 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ro Yo Mi Over a year ago

They probably have to leave it open for some period of time based on a SO limitation for new users.

Collectives™ on Stack Overflow

How to parse HTML using HTML Agility pack

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related