Introduction
From the following conversations
rekire
Why don't you simply use regular expressions? –
GionaF
rekire i've been doing that for a long time, but i'm trying to switch to DOMDocument / html5lib ... codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html`
I totally agree that is why i believe this is not a Job for Both DomDocument & Regular Expresstion because you are dealing with issues of depreciated HTML Tags that are no longer supported in HTML 5
Implication
This means that font is not your only issue you might also have to replace
- acronym
- applet
- basefont
- big
- center
- dir
- frame
- frameset
- noframes
- s
- strike
- tt
- xmp
Use Tidy
I would Recommend Tidy which was designed so that you don't have to do what you are about to do
FORM PHP DOC
Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.
Example
$html = '<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>';
$config = array(
'indent' => true,
'show-body-only' => false,
'clean' => true,
'output-xhtml' => true,
'preserve-entities' => true);
$tidy = new tidy();
echo $tidy->repairString($html, $config, 'UTF8');
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<style type="text/css">
/*<![CDATA[*/
span.c2 {
color: #FF0000
}
span.c1 {
font-size: 120%
}
/*]]>*/
</style>
</head>
<body><span class="c2">Lorem <span class="c1">ipsum dolor</span> sit amet</span>
</body>
</html>
See also see Cleaning HTML by removing extra/redundant formatting tags for examples
Better Sill : HTMLPurifier
You can use HTMLPurifier which also uses Tidy to clean up the HTML all you need is to set the TidyLevel
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications
require_once 'htmlpurifier-4.4.0/library/HTMLPurifier.auto.php';
$html = '<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.TidyLevel', 'heavy');
$purifier = new HTMLPurifier($config);
$clean = $purifier->purify($html);
var_dump($clean);
Output
string '<span style="color:#ff0000;">Lorem <span style="font-size:large;">ipsum dolor</span> sit amet</span>' (length=100)
I want DOMDocument
If all you want is dom and you don't care about all my explanations then you can use
$html = '<font color="#ff0000">Lorem <font size="4">ipsum dolor</font> sit amet</font>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = iterator_to_array($dom->getElementsByTagName('font'));
foreach ( $nodes as $font ) {
$css = array();
$font->hasAttribute('size') and $css[] = 'font-size:' . round($font->getAttribute('size') / 2, 1) . 'em;';
$font->hasAttribute('color') and $css[] = 'color:' . $font->getAttribute('color') . ';';
$span = $dom->createElement('span');
$children = array();
foreach ( $font->childNodes as $child )
$children[] = $child;
foreach ( $children as $child )
$span->appendChild($child);
$span->setAttribute('style', implode('; ', $css));
$font->parentNode->replaceChild($span, $font);
}
echo "<pre>";
$dom->formatOutput = true;
print(htmlentities($dom->saveXML()));