Regex Html Tricky

Question

I have this regex line but it's not working perhaps due to newlines? My goal is to extract the passengers name and phone number.

Here is a snippet of the data i have... it's in a loop of 100 of the below:

<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Ms Wendy Walker-hunter
    </p>

    <p>
        <b>Mobile Number:</b><br />
        161525961468
    </p>

I'm currently just trying to get passengers name first...

$re = '/(?<=Name)(.*)(?=Mobile)/s';
preg_match($re, $str, $matches);

// Print the entire match result
print_r($matches);

Any kind of help I can get on this is greatly appreciated!

You should use a DOM parser to extract this data. You can target each .booking-section element, and list the passenger name as the first <p> tag, and the mobile number as the second. Then you can strip out the <b> and its contents, and the <br />. Don't use regex for this. — scrowler
– scrowler, Commented Feb 20, 2017 at 23:14

miken32 · Accepted Answer · 2017-09-13 18:53:12Z

1

Never parse HTML with a regular expression. Here's how you should be doing this sort of thing:

$html = '<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Ms Wendy Walker-hunter
    </p>

    <p>
        <b>Mobile Number:</b><br />
        161525961468
    </p>
</div>
<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Mr John Walker
    </p>

    <p>
        <b>Mobile Number:</b><br />
        16153682486
    </p>
</div>
';
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//div[@class='booking-section']/p[1]/text()[normalize-space()]");
foreach ($results as $node) {
    echo trim($node->textContent) . "\n";
}

This uses an XPath query to get the nodes you're looking for:

//div[@class='booking-section']/p[1]/text()[normalize-space()]

This tells it to select bare text nodes from the first <p> element inside a <div> with class attribute of "booking-section."

According to the documentation:

this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.

I've enabled libxml's internal error handling for this example, to suppress any warnings about the HTML, though of course you should not be outputting warnings to users anyway.

edited Sep 13, 2017 at 18:53

answered Feb 21, 2017 at 0:18

miken32

42.5k16 gold badges127 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

thevoipman Over a year ago

thanks for this but i'm getting nasty errors Warning: DOMDocument::loadHTML(): Misplaced DOCTYPE declaration in Entity, line:

miken32 Over a year ago

The code as provided works fine for me, are you trying it using the HTML that's above, or the full HTML document?

Vasil Anagnostos · Accepted Answer · 2017-02-21 00:45:07Z

0

This should work if snippets are always formatted as the example, it relies on the new lines:

$t = '
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468
  </p>
</div>';

preg_match('/Passenger Name:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $name);

preg_match('/Mobile Number:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $phone);

echo trim($name[1]), ' / ', trim($phone[1]);

Outpus is: Ms Wendy Walker-hunter / 161525961468

Same with preg_match_all:

$t = '
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468
  </p>
</div>
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter 2
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468 2
  </p>
</div>
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter 3
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468 3
  </p>
</div>';

preg_match_all('/Passenger Name:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $name);

preg_match_all('/Mobile Number:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $phone);

echo '<pre>';
print_r($name);
print_r($phone);
die;

Output is something like

Array
(
    [1] => Array
    (
            [0] =>     Ms Wendy Walker-hunter
            [1] =>     Ms Wendy Walker-hunter 2
            [2] =>     Ms Wendy Walker-hunter 3
        )

)
Array
(
    [1] => Array
    (
            [0] =>     161525961468
            [1] =>     161525961468 2
            [2] =>     161525961468 3
        )

)

edited Feb 21, 2017 at 0:45

answered Feb 20, 2017 at 23:58

Vasil Anagnostos

393 bronze badges

3 Comments

thevoipman Over a year ago

right, but what if there are more than one listings?

miken32 Over a year ago

@thevoipman Or what if the whitespace doesn't match perfectly? That's one more reason why you shouldn't parse HTML with regular expressions.

Vasil Anagnostos Over a year ago

If it is not in a loop as you mentioned, you can use preg_match_all.

Collectives™ on Stack Overflow

Regex Html Tricky

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related