1

I am trying to extract some strings from the source code of a web page which looks like this :

<p class="someclass">
String1<br />
String2<br />
String3<br />
</p>

I'm pretty sure those strings are the only things that end with a single line break(
). Everything else ends with two or more line breaks. I tried using this :

preg_match_all('~(.*?)<br />{1}~', $source, $matches);

But it doesn't work like it's supposed to. It returns some other text too along with those strings.

2
  • @Jack : Nope. It's a complete mess. I only want the strings. It returns a whole lot more. Commented Jun 18, 2013 at 13:37
  • Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. Commented Jun 18, 2013 at 15:10

4 Answers 4

3

DOMDocument and XPath to the rescue.

$html = <<<EOM
<p class="someclass">
String1<br />
String2<br />
String3<br />
</p>
EOM;

$doc = new DOMDocument;
$doc->loadHTML($html);
$xp = new DOMXPath($doc);

foreach ($xp->query('//p[contains(concat(" ", @class, " "), " someclass ")]') as $node) {
    echo $node->textContent;
}

Demo

Sign up to request clarification or add additional context in comments.

Comments

2

I wouldn't recommend using a regular expression to get the values. Instead, use PHP's built in HTML parser like this:

$dom = new DOMDocument();
$dom->loadHTML($source);
$xpath = new DOMXPath($dom);

$elements = $xpath->query('//p[@class="someclass"]');
$text = array(); // to hold the strings
if (!is_null($elements)) {
    foreach ($elements as $element) {
        $text[] = strip_tags($element->nodeValue);
    }
}
print_r($text); // print out all the strings

This is tested and working. You can read more about the PHP's DOMDocument class here: http://www.php.net/manual/en/book.dom.php

Here's a demonstration: http://phpfiddle.org/lite/code/0nv-hd6 (click 'Run')

Comments

-1

Try this:

preg_match_all('~^(.*?)<br />$~m', $source, $matches);

1 Comment

Not sure why it's still down voted, it should work as expected within its limited scope.
-1

Should work. Please try it

preg_match_all("/([^<>]*?)<br\s*\/?>/", $source, $matches);

or if your strings may contain some HTML code, use this one:

preg_match_all("/(.*?)<br\s*\/?>\\n/", $source, $matches);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.