2

I'm trying to parse blocks of text with html tags, but I have some problems.

<?php
    libxml_use_internal_errors(true);
    $html = '
<html>
<body>
    <div>
        Message <b>bold</b>, <s>strike</s>
    </div>
    <div>
        <span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>
        </span>
    </div>
</body>
</html>
    ';

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->strictErrorChecking = false;
    $dom->recover = true;
    $dom->loadHTML($html);        

    function getMessages($element, $xpath)
    {
        $messages = array();

        $children = $element->childNodes;        

        foreach ($children as $child) 
        { 

            if(strtolower($child->nodeName) == 'div')
            {
                // my functions
            }
            else
            if ($child->nodeType == XML_TEXT_NODE)
            {
                $text = trim(DOMinnerHTML($element));
                if($text)
                {
                    $messages[] = array('type' => 'text', 'text' => $text);
                }
            }
        }

        return $messages;
    }

    function DOMinnerHTML($element) 
    {
        $innerHTML = null; 
        $children = $element->childNodes;

        foreach ($children as $child) 
        {
            $tmp_dom = new DOMDocument(); 
            $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
            $innerHTML .= trim($tmp_dom->saveHTML()); 
        } 
        return $innerHTML; 
    } 

    $xpath = new DOMXPath($dom);
    $messagesXpath = $xpath->query("//div");

    $messages = array();
    $i = 0;
    foreach($messagesXpath as $message)
    {
        $messages[] = getMessages($message, $xpath);
        $i++;
        if ($i == 2)
        break;
    }

    var_dump($messages);  

This code returns the following array:

array(2) {
  [0]=>
  array(3) {
    [0]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(32) "Message<b>bold</b>,<s>strike</s>"
    }
    [1]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(32) "Message<b>bold</b>,<s>strike</s>"
    }
    [2]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(32) "Message<b>bold</b>,<s>strike</s>"
    }
  }
  [1]=>
  array(2) {
    [0]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(100) "<span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>

        </span>"
    }
    [1]=>
    array(2) {
      ["type"]=>
      string(4) "text"
      ["text"]=>
      string(100) "<span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>
        </span>"
    }
  }
}

I want to have the $messages['text'] with html tags (it's OK) were, but the array for some reason, repeated!!!!

I think that's problem in this block

if ($child->nodeType == XML_TEXT_NODE)
{
    $text = trim(DOMinnerHTML($element));
    if($text)
    {
          $messages[] = array('type' => 'text', 'text' => $text);
    }
}
3
  • 1
    Is all that code above necessary for the question? Can you cut it down to the relevant portions please? Awful lot to wade through. Commented Jun 2, 2011 at 11:28
  • BTW, Why don't you use HEREDOC? Commented Jun 2, 2011 at 11:45
  • @SalmanPK, What does it matter to the question? Commented Jun 2, 2011 at 12:03

1 Answer 1

1

I think that you are misunderstanding which elements are beings iterated, as you are selecting all the <div>s and then passing each one to getMessages. However, inside getMessages you then iterating over the XML_TEXT_NODE childNodes of each <div> , which is where the double duplication is coming from.

Let's take the HTML:

<div>
    Message <b>bold</b>, <s>strike</s>
</div>

DOM elements and text nodes are logically different and have different types - XML_ELEMENT_NODE and XML_TEXT_NODE (see here for full list), therefore the <div> actually contains 5 children (TEXT, ELEMENT, TEXT, ELEMENT, TEXT). You were correct to identify the problematic if condition, however simply changing the type to *XML_ELEMENT_NODE* does not completely fix the problem. There are still multiple childNodes where the type is XML_ELEMENT_NODE for each <div>.

To fully fix the problem, I changed the element being passed to the getMessages function so that function can iterate at the correct level and eliminating the duplication. I also removed some complexity improved readability by renaming some variables.

Here is my complete solution:

<?php
    libxml_use_internal_errors(true);
    $html = <<<HTML
<html>
<body>
    <div>
        Message <b>bold</b>, <s>strike</s>
    </div>
    <div>
        <span class="how">
            <a href="link" title="text">Link</a>, <b> BOLD </b>
        </span>
    </div>
</body>
</html>
HTML;

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->strictErrorChecking = false;
    $dom->recover = true;
    $dom->loadHTML($html);

    function getMessages($allDivs) {
        $messages = array();

        foreach ($allDivs as $div)  {
            if ($div->nodeType == XML_ELEMENT_NODE) {
                $messages[] = trim(DOMinnerHTML($div));
            }
        }

        return $messages;
    }

    function DOMinnerHTML($element) {
        $innerHTML = null;
        $children = $element->childNodes;

        foreach ($children as $child) {
            $tmp_dom = new DOMDocument();
            $tmp_dom->appendChild($tmp_dom->importNode($child, true));
            $innerHTML .= trim($tmp_dom->saveHTML());
        }
        return $innerHTML;
    }

    $xpath = new DOMXPath($dom);
    $messagesXpath = $xpath->query("//div");

    $messages[] = getMessages($messagesXpath);

    print_r($messages);
?>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.