0

Hi I need a script to remove, from a html string, all "li" elements empty or with only spaces. But also with inside empty tag (one or nested empty tags)

I use this preg_replace to remove succesfully only empty "li". In this case the 4th li.

But i don't know how to remove last "li" that has got an empty "span" inside it... any suggest? Thanks

$contenuto = '<ol style="margin-top: 0cm; margin-bottom: 0cm;">
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">x</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">y</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">z</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt; color: red;"> </span></li>
</ol>';

$contenuto = preg_replace('/<li[^>]*>(\s|&nbsp;)*<\/li>/', '', $contenuto);

echo $contenuto;
7

2 Answers 2

4

The XPath to select the empty li nodes is

//li[not(normalize-space())]

An XPath query is not what you asked for. But I find that much more concise and readable and easier to come up with than a reliable Regex that does the same.

Unfortunately, PHP doesn't have something like an xpath_replace function which hides away all the boilerplate to do what preg_replace does for a Regex. So you'd have to write some additional code to get your desired output:

<?php
$html = '<ol style="margin-top: 0cm; margin-bottom: 0cm;">
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">x</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">y</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">z</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt; color: red;"> </span></li>
</ol>';

$emptyLists = '//li[not(normalize-space())]';

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach($xpath->query($emptyLists) as $node) {
    $node->parentNode->removeChild($node);
}

echo $dom->saveHTML();

will output

<ol style="margin-top: 0cm; margin-bottom: 0cm;">
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">x</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">y</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">z</span></li>


</ol>

Demo https://3v4l.org/o0K0H

Sign up to request clarification or add additional context in comments.

4 Comments

Thankyou, it works great. I've modified this line to prevent adding <HTML> tag not needed: $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
I don't think saveHTML() needs any parameters passed in. 3v4l.org/o0K0H
@mickmackusa you can pass a DOMNode to saveHTML. Then it will just use that tree.
But yes, you don't need it when using LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
1

I answer to quote 2 things:

  1. @bobble bubble is right when he said you can parse small pieces of HTML using Regex, especially when you are sure about the encoding / language...
  2. You can use ChatGPT when you deal with Regex, it works well when you need something simple.

Here is my answer:

$regex = '/<li[^>]*>(?:\s*|(?:<[^>\/]+[^>]*>\s*<\/[^>]+>)(?:\s*|<\/?\w+[^>]*>\s*))<\/li>/s';
$contenuto = preg_replace($regex, '', $contenuto);    

3 Comments

Your RegEx doesn't work when I change the <span> to a <b>: 3v4l.org/EGb2j You may argue that that's not what's asked, but I think it is. The question talks about "empy tags" and gives <span> just as an example.
@KIKOSoftware indeed ! I misread the question sorry, updated, it seems to work with any empty tag now
Yes, you can adapt your RegEx as long as you know what to adapt to. The problem with HTML is that it can vary a lot. Suppose the <b> tag surrounds the <span> tag, then your expression fails again: 3v4l.org/0Gmf3 This can go on endlessly. Your expression will become so complicated that no sane person can understand it anymore. Try the other answer here, note how it can cope with any HTML and is still, somewhat, comprehensible.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.