2

I need to take a string of html text like:

<p>This is a line with no spans<br>
This is a line <span class="second">This is secondary</span><br>  
This is another line <span class="third">And this is third</span> <span class="four">this is four</span></p>

And have it end up as an array in PHP like:

array(
    "This is a line with no spans",
    array(
      "This is a line",
      second => "This is secondary",
    ),
    array(
      "This is another line",
      third => "And this is third",
      four => "this is four"
    )
);

Getting each line into it's own value was easy, I just split the text on <br> and that works fine, but getting lines to be split with the class name I can't quite get. I feel like php's preg_split may hold the key, but I kind of suck with regular expressions and I can't get it figured out.

Any ideas?

3 Answers 3

3

You should not attempt to parse HTML with regex or other means. It is very complicated and will end up with terrible maintenance problems.

I highly recommend you look into how to read a chunk of markup into a DOM document [docs] and then use DOM methods to work with it just like you would browser side.

Sign up to request clarification or add additional context in comments.

3 Comments

I've been using DomDocument to get to the point of getting the p tags, but I couldn't figure out a way to get it to split on the line breaks without it becoming text.
I wouldn't split on line breaks. Walk the nodes, checking their type and name (Do I have a text node? Do I have a BR element? ), and make decisions with that info.
I could have sworn I had tried that and it didn't work, but it did this time. Thanks man!
1

It's not a good idea to use regular expressions to parse HTML (cite). It's just not a suitable tool; see @JAAulde's answer.

The best way is to do it purely with the DOM. Loop through all child nodes (including text nodes) to format the array the way you want. Like this:

$p = // get paragraph tag...
$lines = array();
$pChildren = $p->childNodes;
for ($i = 0; $i < $pChildren->length; $i++) {
    $line = array();
    $child = $pChildren->item($i);
    if ($child instanceof DOMText) {
        $line[] = $child->wholeText;
    } elseif ($child instanceof DOMElement) {
        if (strtolower($child->tagName) == 'br') {
            $lines[] = $line;
            $line = array();
        } elseif (strtolower($child->tagName) == 'span' && $child->hasAttribute('class')) {
            $line[$child->getAttribute('class')] = $child->nodeValue;
        }
    }
}

Warning: treat the above as pseudo-code, it has not been tested at all, just going from experience and the manual.

3 Comments

I just finished writing this and came back and saw your answer. Almost identical.
For those who come along later with the same question, I do not dispute this being the correct answer. However it is important to point out that the missing step to get from what the OP has to what was accepted as an answer was the reading in of the markup to a PHP DOM Document. See my answer for links to docs on that.
@JAAulde: excellent point, I'll allude to that and refer to your answer.
1

Maybe you can use an XML parser ? Here's the doc.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.