2

I am using regex to parse HTML but some article says that HTMLAgilityPack is much easier.The big question for me is how to parse html for this sample (twitter):

This the HTML code:

<p class="js-tweet-text tweet-text"> What an awesome day! Adventure nanaman kahapon <a href="http" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b><strong>ondoy</strong></b></a> <a href="https://twitter.com/search?q=%23eurotel&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b>eurotel</b></a> <a href="https://twitter.com/search?q=%23retail&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b>retail</b></a> <a href="https://twitter.com/search?q=%23family&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr"><s>#</s><b>family</b></a></p>

and I want it to output like this:

"What an awesome day! Adventure nanaman kahapon #ondoy #eurotel #retail #family"

How do I parse that html code. I am using regex now but it displays other tags like href.

this is my regex code.

           WebClient web = new WebClient(); 
           string html = web.DownloadString(filename);

            MatchCollection m1 = Regex.Matches(html, "<p class=\"js-tweet-text tweet-text\">\\s*(.+?)\\s*</p>", RegexOptions.Singleline);
            foreach (Match m in m1)
            {
                MessageBox.Show(m.Groups[1].Value);
            }
1
  • you should use a combination of HTML agility pack and Regex. Html agility pack to get html and regex to parse the data that you need to parse within the html like the above tags Commented Jul 21, 2013 at 11:02

1 Answer 1

5
     HtmlWeb p = new HtmlWeb();
     var doc= p.Load(@"link your HTML page");
     var node =  doc.DocumentNode.SelectNodes("//p[@class='js-tweet-text tweet-text']").FirstOrDefault();

     if (node != null)
     {
       Console.WriteLine(node.InnerText);
     }

i just tested it myself and this prints out

What an awesome day! Adventure nanaman kahapon #ondoy #eurotel #retail #family"

Do note that if you are going to run this piece of code on an actual twitterpage, there are going to be multiple tweets, so you will need to make some modifications to the code posted above. But this should give you a good idea on how to use it.

Sign up to request clarification or add additional context in comments.

1 Comment

They probably have to leave it open for some period of time based on a SO limitation for new users.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.