1

I've this html fragment:

<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>

and i want to replace each font tag with a span, using DOMDocument. That's my function atm:

$fonts = $xPath->query('//font');
foreach($fonts as $font){
    $style = '';
    $newFont = $dom->createElement('span',$font->nodeValue);
    if($font->hasAttribute('size')){
        $size = $font->getAttribute('size');
        $style.='font-size:'.round($size/2,1).'em; ';
    }
    if($font->hasAttribute('color')){
        $style.='color:'.$font->getAttribute('color').'; ';
    }
    if($style!='') $newFont->setAttribute('style',$style);
    $font->parentNode->replaceChild($newFont,$font);
}

I expected this output:

<span style="color:#ff0000; ">Lorem <span style="font-size:2em;">ipsum etc..

But i get:

<span style="color:#ff0000; ">Lorem ipsum dolor sit amet</span>

Why?


I guess it happens because $font->parentNode->replaceChild($newFont,$font); is somehow replacing the outer span with just its text value... Or maybe this query $xPath->query('//font') is wrong. I'ld love an experienced suggestion... thanks

4
  • Why don't you simply use regular expressions? Commented Nov 6, 2012 at 15:56
  • @rekire i've been doing that for a long time, but i'm trying to switch to DOMDocument / html5lib ... codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html Commented Nov 6, 2012 at 16:00
  • I know that html tag pairs cannot be replaced with regular expressions, but that simple closing font tags can been replaced with an closing span in every case isn't it? Commented Nov 6, 2012 at 16:07
  • yup @rekire i could handle this particular case even with str_replace and preg_match... i just want to understand how DOMDocument works, but i get lost in the official documentation ;-) Commented Nov 6, 2012 at 16:25

3 Answers 3

8
+150

Introduction

From the following conversations

rekire

Why don't you simply use regular expressions? –

GionaF

rekire i've been doing that for a long time, but i'm trying to switch to DOMDocument / html5lib ... codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html`

I totally agree that is why i believe this is not a Job for Both DomDocument & Regular Expresstion because you are dealing with issues of depreciated HTML Tags that are no longer supported in HTML 5

Implication

This means that font is not your only issue you might also have to replace

  • acronym
  • applet
  • basefont
  • big
  • center
  • dir
  • frame
  • frameset
  • noframes
  • s
  • strike
  • tt
  • xmp

Use Tidy

I would Recommend Tidy which was designed so that you don't have to do what you are about to do

FORM PHP DOC

Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.

Example

$html = '<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>';
$config = array(
        'indent' => true,
        'show-body-only' => false,
        'clean' => true,
        'output-xhtml' => true,
        'preserve-entities' => true);

$tidy = new tidy();
echo $tidy->repairString($html, $config, 'UTF8');

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title></title>
        <style type="text/css">
            /*<![CDATA[*/
            span.c2 {
                color: #FF0000
            }
            span.c1 {
                font-size: 120%
            }
            /*]]>*/
        </style>
    </head>
    <body><span class="c2">Lorem <span class="c1">ipsum dolor</span> sit amet</span>
    </body>
</html>

See also see Cleaning HTML by removing extra/redundant formatting tags for examples

Better Sill : HTMLPurifier

You can use HTMLPurifier which also uses Tidy to clean up the HTML all you need is to set the TidyLevel

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications

require_once 'htmlpurifier-4.4.0/library/HTMLPurifier.auto.php';

$html = '<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.TidyLevel', 'heavy'); 
$purifier = new HTMLPurifier($config);
$clean = $purifier->purify($html);

var_dump($clean);

Output

string '<span style="color:#ff0000;">Lorem <span style="font-size:large;">ipsum dolor</span> sit amet</span>' (length=100)

I want DOMDocument

If all you want is dom and you don't care about all my explanations then you can use

$html = '<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = iterator_to_array($dom->getElementsByTagName('font'));
foreach ( $nodes as $font ) {
    $css = array();
    $font->hasAttribute('size') and $css[] = 'font-size:' . round($font->getAttribute('size') / 2, 1) . 'em;';
    $font->hasAttribute('color') and $css[]  = 'color:' . $font->getAttribute('color') . ';';
    $span = $dom->createElement('span');
    $children = array();
    foreach ( $font->childNodes as $child )
        $children[] = $child;
    foreach ( $children as $child )
        $span->appendChild($child);
    $span->setAttribute('style', implode('; ', $css));
    $font->parentNode->replaceChild($span, $font);
}
echo "<pre>";
$dom->formatOutput = true;
print(htmlentities($dom->saveXML()));
Sign up to request clarification or add additional context in comments.

5 Comments

Both your HTMLPurifier and DOMDocument methods work like a charm! Again, thanks a lot. I've just a small issue left: how can i produce an html5 output? HTMLPurifier will turn a <br> into <br/>. So at the moment i'm passing the formatted $clean to html5lib: HTML5_Parser::parse($clean). Is there a way to achieve the same result with HTMLPurifier only?
HTMLPurifier actually uses tidy to achieve this ... <br /> for now it only tidy i am certain has the feature
Mmmh i see. Sad that fantastic libraries like DOMDocument and Querypath doesn't support html5 yet. Thanks for your time Baba, i owe you one ;-)
You are welcome anytime ...looking at wiki.php.net/rfc am not sure HTML would be supported anytime soon
Why are you using iterartor_to_array function here to convert the DOMNodeList object?
2

It is possible with XSL to change the tags to spans.

<?php

$dom = new DOMDocument();

$dom->loadXML('<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>');

echo "Starting Point:" . $dom->saveXML() . PHP_EOL;

$xsl = new DOMDocument('1.0', 'UTF-8');
// Could be a seperate file
$xsl->loadXML(<<<XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

    <!-- Identity rule -->
    <xsl:template match="@*|node()"><xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy></xsl:template>
    <xsl:template match="text()"><xsl:value-of disable-output-escaping="yes" select="."/></xsl:template>

    <xsl:template match="font">
        <xsl:element name="span">
            <xsl:attribute name="style" xsl:space="default">
                <xsl:if test="@size">font-size: <xsl:value-of select="round(@size * 10 div 2) div 10" /> em;</xsl:if>
                <xsl:if test="@color">color: <xsl:value-of select="@color" />;</xsl:if>
            </xsl:attribute>
            <xsl:apply-templates select="node()"/>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>
XSLT
);

$proc = new XSLTProcessor();
$proc->importStylesheet($xsl);
echo $proc->transformToXML($dom);

3 Comments

+1 for using your time to write this answer ... doesn't work in my own case because i don't have control over the markup, but it could be useful to someone else
Very cool! Would it be possible to change this so that comments are correctly indented and CDATA sections are NOT added and HTML5 self-closing tags don't come out as inline empty tags such as <br></br>?
There are a number of options in xslt to fiddle with that can tweak the output. On: w3.org/TR/xslt there is <xsl:preserve-space/>, and <xsl:output method=""/> that may work for you.
1

It appear your code sample is running into a couple of different issues.

  1. The query results contain items that are changing
  2. $node->nodValue doesn't contain the child nodes

Found changing from a foreach to a while, and running the query multiple times got around the issue with finding nodes in a changing tree.

$fonts = $xPath->query('//font');
while ($fonts->length > 0) {
    $font = $fonts->item(0);

    // Get bits of data before touching the tree

    $style   = '';
    if($font->hasAttribute('size')){
        $size   = $font->getAttribute('size');
        $style .= 'font-size:' . round($size/2, 1) . 'em; ';
    }
    if($font->hasAttribute('color')){
        $style .= 'color:' . $font->getAttribute('color') . '; ';
    }

    // Create the new node

    $newFont = $dom->createElement('span');
    if(!empty($style)) {
        $newFont->setAttribute('style', $style);
    }


    // Copy all children into a basic array to avoid an iterator
    // on a changing tree
    $children = iterator_to_array($font->childNodes);
    foreach ($children as $child) {
        // This has a side effect of removing the child from its old
        // location, which changes the tree
        $newFont->appendChild($child);
    }

    // Replace the parent's child, which changes the tree
    $font->parentNode->replaceChild($newFont, $font);


    // query again on the new tree
    $fonts = $xPath->query('//font');
}

1 Comment

Thanks, it works! So there's no easy way for replacing nested elements with DOMDocument?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.