2

Hey all I'm in need of some help trying to figure out the RegEx formula for finding the values within the tags of HTML mark-up like this:

<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span> 

I only need 1993, R, 2.8 and 94% from that HTML above.

Any help would be great as I don't have much knowledge when it comes to forming one of these things.

4
  • 3
    I'd suggest not using regex for a task like this. Read this question on HTML parsing in .NET. Commented Apr 4, 2011 at 12:23
  • possible duplicate of RegEx match open tags except XHTML self-contained tags Commented Apr 4, 2011 at 12:23
  • @Matt Ball - How is it a duplicate? Commented Apr 4, 2011 at 12:24
  • @Kobi it's just the archetypal "Don't use regex to parse (X)HTML" question on SO. Commented Apr 4, 2011 at 12:30

2 Answers 2

3

Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.

Sign up to request clarification or add additional context in comments.

Comments

3

If you already have the HTML in a string:

string html = @"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";

Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");

Using the HTML Agility Pack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");

Now you can iterate over them, or simply get the text of each node:

IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();

Alternatively, you can search for the node you're after:

HtmlNode nodeReleaseYear = doc.DocumentNode
                              .SelectSingleNode("//span[@class='releaseYear']");
string year = nodeReleaseYear.InnerText;

5 Comments

The code is C#, but it should be easy enough to convert to VB.Net.
How do i get it working? I put HtmlDocument doc = new HtmlDocument() but it has it underlined saying HtmlDocument' is a type and cannot be used as an expression and Name 'doc' is not declared.
@StealthRT, have you added a reference to HtmlAgilityPack in your project?
@tster. Yes, i have Imports HtmlAgilityPack
@StealthRT - I'm not sure what the VB script should look like. I'd guess Dim doc as new HtmlDocument . Maybe this will help: stackoverflow.com/questions/516811/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.