0

I have the following xml document:

<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
    <li>Bulleted style text
        <ul>
            <li>
                <paragraph>1.Sub Bulleted style text</paragraph>
            </li>
        </ul>
    </li>
</ul>
<ul>
    <li>Bulleted style text <strong>bold</strong>
        <ul>
            <li>
                <paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
            </li>
        </ul>
    </li>
</ul>

I need to remove the numbers preceeding the Sub-bulleted text. 1. and 2. in the given example

This is the code I have so far:

<?php
class MyDocumentImporter
{
    const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';

    protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';

    protected $dom;

    public function processListsText( $loop = null ){

        $this->dom = new DomDocument('1.0', 'UTF-8');

        $this->dom->loadXML($this->xml_string);

        if(!$loop){
            //get all the li tags
            $li_set = $this->dom->getElementsByTagName('li');
        }
        else{
            $li_set = $loop;
        }

        foreach($li_set as $li){

            //check for child nodes
            if(! $li->hasChildNodes() ){
                continue;
            }

            foreach($li->childNodes as $child){
                if( $child->hasChildNodes() ){
                    //this li has children, maybe a <strong> tag
                    $this->processListsText( $child->childNodes );
                }
                if( ! ( $child instanceof DOMElement ) ){
                    continue;
                }
                if( ( $child->localName != 'paragraph') ||  ( $child instanceof DOMText )){
                    continue;
                }
                if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
                    continue;
                }

                $clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);

                //set node to empty
                $child->nodeValue = '';

                //add updated content to node
                $child->appendChild($child->ownerDocument->createTextNode($clean_content));

                //$xml_output = $child->parentNode->ownerDocument->saveXML($child);
                //var_dump($xml_output);

            }
        }
    }
}

$importer = new MyDocumentImporter();
$importer->processListsText();

The issue I can see is that $child->textContent returns the plain text content of the node, and strips the additional child tags. So:

<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>

becomes

<paragraph>Sub Bulleted bold</paragraph>

The <strong> tag is no more.

I'm a little stumped... Can anyone see a way to strip the unwanted characters, and retain the "inner child" <strong> tag?

The tag may not always be <strong>, it could also be a hyperlink <a href="#">, or <emphasize>.

3
  • This doesn't even parse properly as XML. Commented May 21, 2013 at 17:09
  • @Jack: his formatted example doesn't, his inline code example does. Commented May 21, 2013 at 17:10
  • You can just use \. instead of [\.]{1} BTW. Commented May 21, 2013 at 17:11

2 Answers 2

3

Assuming your XML actually parses, you could use XPath to make your queries a lot easier:

$xp = new DOMXPath($this->dom);

foreach ($xp->query('//li/paragraph') as $para) {
        $para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue);
}

It does the text replacement on the first text node instead of the whole tag contents.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help, the code I posted is just one part of a multi-faceted document processing class. Hopefully with the xpath tip I'll be able to clean up a lot of the code!
1

You resetting its whole content, but what you want is only to alter the first text node (keep in mind text nodes are nodes too). You might want to look for the xpath //li/paragraph/text()[position()=1], and work on / replace that DOMText node instead of the whole paragraph content.

$d = new DOMDocument();
$d->loadXML($xml);
$p = new DOMXPath($d);
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){
        $text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.