PHP: DOMDocument: Remove Unwanted Text from a Nested Element

Question

I have the following xml document:

<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
    <li>Bulleted style text
        <ul>
            <li>
                <paragraph>1.Sub Bulleted style text</paragraph>
            </li>
        </ul>
    </li>
</ul>
<ul>
    <li>Bulleted style text <strong>bold</strong>
        <ul>
            <li>
                <paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
            </li>
        </ul>
    </li>
</ul>

I need to remove the numbers preceeding the Sub-bulleted text. 1. and 2. in the given example

This is the code I have so far:

<?php
class MyDocumentImporter
{
    const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';

    protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';

    protected $dom;

    public function processListsText( $loop = null ){

        $this->dom = new DomDocument('1.0', 'UTF-8');

        $this->dom->loadXML($this->xml_string);

        if(!$loop){
            //get all the li tags
            $li_set = $this->dom->getElementsByTagName('li');
        }
        else{
            $li_set = $loop;
        }

        foreach($li_set as $li){

            //check for child nodes
            if(! $li->hasChildNodes() ){
                continue;
            }

            foreach($li->childNodes as $child){
                if( $child->hasChildNodes() ){
                    //this li has children, maybe a <strong> tag
                    $this->processListsText( $child->childNodes );
                }
                if( ! ( $child instanceof DOMElement ) ){
                    continue;
                }
                if( ( $child->localName != 'paragraph') ||  ( $child instanceof DOMText )){
                    continue;
                }
                if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
                    continue;
                }

                $clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);

                //set node to empty
                $child->nodeValue = '';

                //add updated content to node
                $child->appendChild($child->ownerDocument->createTextNode($clean_content));

                //$xml_output = $child->parentNode->ownerDocument->saveXML($child);
                //var_dump($xml_output);

            }
        }
    }
}

$importer = new MyDocumentImporter();
$importer->processListsText();

The issue I can see is that $child->textContent returns the plain text content of the node, and strips the additional child tags. So:

<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>

becomes

<paragraph>Sub Bulleted bold</paragraph>

The <strong> tag is no more.

I'm a little stumped... Can anyone see a way to strip the unwanted characters, and retain the "inner child" <strong> tag?

The tag may not always be <strong>, it could also be a hyperlink <a href="#">, or <emphasize>.

@Jack: his formatted example doesn't, his inline code example does. — Wrikken
– Wrikken, Commented May 21, 2013 at 17:10

Ja͢ck · Accepted Answer · 2013-05-21 17:16:28Z

3

Assuming your XML actually parses, you could use XPath to make your queries a lot easier:

$xp = new DOMXPath($this->dom);

foreach ($xp->query('//li/paragraph') as $para) {
        $para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue);
}

It does the text replacement on the first text node instead of the whole tag contents.

answered May 21, 2013 at 17:16

Ja͢ck

174k39 gold badges269 silver badges317 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gArn Over a year ago

Thanks for the help, the code I posted is just one part of a multi-faceted document processing class. Hopefully with the xpath tip I'll be able to clean up a lot of the code!

Wrikken · Accepted Answer · 2013-05-21 17:07:17Z

1

You resetting its whole content, but what you want is only to alter the first text node (keep in mind text nodes are nodes too). You might want to look for the xpath //li/paragraph/text()[position()=1], and work on / replace that DOMText node instead of the whole paragraph content.

$d = new DOMDocument();
$d->loadXML($xml);
$p = new DOMXPath($d);
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){
        $text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text);
}

answered May 21, 2013 at 17:07

Wrikken

70.8k8 gold badges99 silver badges136 bronze badges

Collectives™ on Stack Overflow

PHP: DOMDocument: Remove Unwanted Text from a Nested Element

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related