Regex: Optional HTML tags in HTML?

Question

I need to parse some values from HTML. I'm using the following regex to parse out some groups, but am having difficulty when there are optional tags in the middle of the HTML. I need some rule to pull out the values from repeated version of the HTML page, even when the optional tags are included.

 onclick="return raise('SelectFare', new SelectFareEventArgs(1, 3, 'F'))" required="true" requiredError="Please select a flight and fare in every market."></td><td>Regular Fare</td><td>Adult<br></td><td align="right" style="font-size:110%;">91.99 EUR<br><div style="font-style: italic; font-size: 10px;">Only<span style="color: red;"> 4 </span>seats left at this fare</div></td><td></td><td><b>Fri</b>30 Sep 11<br><b>Flight</b>FR 818</td><td>15:10 Depart<br>16:15 Arrive</td></tr><tr id="1_2011_8_30_23_45_00"><td><div class="planeImg1" title="Click to select this fare on this flight"></div></td><td><input

For example, the optional <div style="font-style: italic; font-size: 10px;">Only<span style="color: red;"> 4 </span>seats left at this fare</div> section of this is messing it up.

tr><tr id="1_2011_9_21_16_05_00"><td><div class="planeImg1" title="Click to select this fare on this flight"></div></td><td><input id="AvailabilityInputFRSelectView_RadioButtonMkt1Fare2" type="radio" name="AvailabilityInputFRSelectView$market1" value="H~HDIS1~XXXC~~RoundFrom|FR~ 816~ ~~DUB~10/21/2011 14:55~EDI~10/21/2011 16:05" onclick="return raise('SelectFare', new SelectFareEventArgs(1, 2, 'H'))" required="true" requiredError="Please select a flight and fare in every market."></td><td>No Taxes</td><td>Adult<br></td><td align="right" style="font-size:110%;"><strike style="color:#F00;font-size:80%;"><b style="color: #999;">22.99 EUR</b></strike>
                             (-35%)
                          <br>14.94 EUR<br></td><td></td><td><b>Fri</b>21 Oct 11<br><b>Flight</b>FR 816</td><td>14:55 Depart<br>16:05 Arrive</td></tr><tr id="1_2011_9_21_16_15_00"><td><div class="planeImg1" title="Click

The

<strike . . </strike>. . (-35%). . <br>14.94 EUR<br></td>

part of the HTML above is messing it up as well.

This is the regex I'm trying (and various other versions!!):

"Please select(?:.*?)<td>(.*?)</td><td>(.*?)<br></td><td align=\"right\" style=\"font-size:110%;\">(.*?)<br>(.*?)<br>(?:.*?)</b>(.*?)<br><b>Flight</b>(.*?)</td><td>(.*?)<br>(.*?)</td>"

I'd appreciate any help at all on this, or even a reference to learning how to parse out optional HTML tags altogether.

Thanks.

Here's the inevitable reference to stackoverflow.com/questions/1732348/… — Shawn Chin
– Shawn Chin, Commented Sep 29, 2011 at 16:43

Community · Accepted Answer · 2017-05-23 09:59:38Z

0

You can't parse (X)HTML with RegEx, so don't do it. You need to use a proper parser that will build you a Document Object Model (DOM). As you have tagged your question with JavaScript, I recommend that you use jQuery to build an object graph of your HTML, simply like this:

var $document = $(html);

This $document object can now be operated on with methods like $document.find() to dig out the elements you want from the HTML.

edited May 23, 2017 at 9:59

CommunityBot

11 silver badge

answered Sep 29, 2011 at 16:52

Asbjørn Ulsberg

8,8403 gold badges48 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rob W Over a year ago

See my profile. I am developing a JavaScript function to sanitise the HTML string using a RE, and parse the HTML using a home-made function. I expect to release it soon (see these answers HTML parser and HTML sanitiser using RE). Once my functions are finished, I merge them to the function at the first question.

Asbjørn Ulsberg Over a year ago

@RobW, you can't sanitize HTML with RegEx. HTML is not a regular, predictable language. It's an irregular and unpredictable language composed of so many moving parts with so many security aspects that it is impossible to write a RegEx that will handle the millions of edge-cases that exist and that is going to be exploited to inject an XSS attack that will make it through your sanitizer, no matter how good it is.

Rob W Over a year ago

The main threats are scripts and external resources, which I'm going to filter. I'm currently porting my (not yet published) relative-to-absolute URL convertor to a sanitiser which filters external resources.

Collectives™ on Stack Overflow

Regex: Optional HTML tags in HTML?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related