-1

I need to parse some values from HTML. I'm using the following regex to parse out some groups, but am having difficulty when there are optional tags in the middle of the HTML. I need some rule to pull out the values from repeated version of the HTML page, even when the optional tags are included.

 onclick="return raise('SelectFare', new SelectFareEventArgs(1, 3, 'F'))" required="true" requiredError="Please select a flight and fare in every market."></td><td>Regular Fare</td><td>Adult<br></td><td align="right" style="font-size:110%;">91.99 EUR<br><div style="font-style: italic; font-size: 10px;">Only<span style="color: red;"> 4 </span>seats left at this fare</div></td><td></td><td><b>Fri</b>30 Sep 11<br><b>Flight</b>FR 818</td><td>15:10 Depart<br>16:15 Arrive</td></tr><tr id="1_2011_8_30_23_45_00"><td><div class="planeImg1" title="Click to select this fare on this flight"></div></td><td><input

For example, the optional <div style="font-style: italic; font-size: 10px;">Only<span style="color: red;"> 4 </span>seats left at this fare</div> section of this is messing it up.

tr><tr id="1_2011_9_21_16_05_00"><td><div class="planeImg1" title="Click to select this fare on this flight"></div></td><td><input id="AvailabilityInputFRSelectView_RadioButtonMkt1Fare2" type="radio" name="AvailabilityInputFRSelectView$market1" value="H~HDIS1~XXXC~~RoundFrom|FR~ 816~ ~~DUB~10/21/2011 14:55~EDI~10/21/2011 16:05" onclick="return raise('SelectFare', new SelectFareEventArgs(1, 2, 'H'))" required="true" requiredError="Please select a flight and fare in every market."></td><td>No Taxes</td><td>Adult<br></td><td align="right" style="font-size:110%;"><strike style="color:#F00;font-size:80%;"><b style="color: #999;">22.99 EUR</b></strike>
                             (-35%)
                          <br>14.94 EUR<br></td><td></td><td><b>Fri</b>21 Oct 11<br><b>Flight</b>FR 816</td><td>14:55 Depart<br>16:05 Arrive</td></tr><tr id="1_2011_9_21_16_15_00"><td><div class="planeImg1" title="Click

The

<strike . . </strike>. . (-35%). . <br>14.94 EUR<br></td>

part of the HTML above is messing it up as well.

This is the regex I'm trying (and various other versions!!):

"Please select(?:.*?)<td>(.*?)</td><td>(.*?)<br></td><td align=\"right\" style=\"font-size:110%;\">(.*?)<br>(.*?)<br>(?:.*?)</b>(.*?)<br><b>Flight</b>(.*?)</td><td>(.*?)<br>(.*?)</td>"

I'd appreciate any help at all on this, or even a reference to learning how to parse out optional HTML tags altogether.

Thanks.

2
  • 5
    I wouldn't use REGEX to parse HTML, use an HTML/XML parser. Commented Sep 29, 2011 at 16:41
  • 5
    Here's the inevitable reference to stackoverflow.com/questions/1732348/… Commented Sep 29, 2011 at 16:43

1 Answer 1

0

You can't parse (X)HTML with RegEx, so don't do it. You need to use a proper parser that will build you a Document Object Model (DOM). As you have tagged your question with JavaScript, I recommend that you use jQuery to build an object graph of your HTML, simply like this:

var $document = $(html);

This $document object can now be operated on with methods like $document.find() to dig out the elements you want from the HTML.

Sign up to request clarification or add additional context in comments.

3 Comments

See my profile. I am developing a JavaScript function to sanitise the HTML string using a RE, and parse the HTML using a home-made function. I expect to release it soon (see these answers HTML parser and HTML sanitiser using RE). Once my functions are finished, I merge them to the function at the first question.
@RobW, you can't sanitize HTML with RegEx. HTML is not a regular, predictable language. It's an irregular and unpredictable language composed of so many moving parts with so many security aspects that it is impossible to write a RegEx that will handle the millions of edge-cases that exist and that is going to be exploited to inject an XSS attack that will make it through your sanitizer, no matter how good it is.
The main threats are scripts and external resources, which I'm going to filter. I'm currently porting my (not yet published) relative-to-absolute URL convertor to a sanitiser which filters external resources.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.