preg_replace help to remove empty "li" element [duplicate]

Question

Hi I need a script to remove, from a html string, all "li" elements empty or with only spaces. But also with inside empty tag (one or nested empty tags)

I use this preg_replace to remove succesfully only empty "li". In this case the 4th li.

But i don't know how to remove last "li" that has got an empty "span" inside it... any suggest? Thanks

$contenuto = '<ol style="margin-top: 0cm; margin-bottom: 0cm;">
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">x</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">y</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">z</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt; color: red;"> </span></li>
</ol>';

$contenuto = preg_replace('/<li[^>]*>(\s|&nbsp;)*<\/li>/', '', $contenuto);

echo $contenuto;

Please carefully read this finely crafted essay: You can't parse HTML with regex. — KIKO Software
– KIKO Software, Commented Aug 6, 2024 at 7:20
Maybe this page can be helpful using DOMDocument / xpath stackoverflow.com/questions/8603237/… — The fourth bird
– The fourth bird, Commented Aug 6, 2024 at 7:23
Or extend your current regex: <li[^>]*>(?:\s+| |</?span[^>]*>)*<\/li> — bobble bubble
– bobble bubble, Commented Aug 6, 2024 at 7:31
@KIKOSoftware Please stop linking to the "Zalgo" / anti-Cthulhu regex rant, Using regular expressions to parse HTML: why not?, Why you shouldn't and when you should use regular expressions? — bobble bubble
– bobble bubble, Commented Aug 6, 2024 at 7:58
@bobblebubble You've got a point. Luckily there are other answers on that page, and other comments here. The point remains: RegEx is poor at parsing HTML that is changeable, and this question seems to imply it is. If it wasn't you wouldn't need the RegEx. — KIKO Software
– KIKO Software, Commented Aug 6, 2024 at 8:10

Gordon · Accepted Answer · 2024-08-09 10:34:37Z

4

The XPath to select the empty li nodes is

//li[not(normalize-space())]

An XPath query is not what you asked for. But I find that much more concise and readable and easier to come up with than a reliable Regex that does the same.

Unfortunately, PHP doesn't have something like an xpath_replace function which hides away all the boilerplate to do what preg_replace does for a Regex. So you'd have to write some additional code to get your desired output:

<?php
$html = '<ol style="margin-top: 0cm; margin-bottom: 0cm;">
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">x</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">y</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">z</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt; color: red;"> </span></li>
</ol>';

$emptyLists = '//li[not(normalize-space())]';

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach($xpath->query($emptyLists) as $node) {
    $node->parentNode->removeChild($node);
}

echo $dom->saveHTML();

will output

<ol style="margin-top: 0cm; margin-bottom: 0cm;">
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">x</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">y</span></li>
<li style="margin: 0cm 0cm 0cm 47.6px; text-align: justify; line-height: normal; font-size: 11pt; font-family: Calibri, sans-serif; text-indent: 0.4px;"><span style="font-size: 10.0pt;">z</span></li>


</ol>

Demo https://3v4l.org/o0K0H

edited Aug 9, 2024 at 10:34

answered Aug 6, 2024 at 10:01

Gordon

318k76 gold badges548 silver badges566 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

itajackass Over a year ago

Thankyou, it works great. I've modified this line to prevent adding <HTML> tag not needed: $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

mickmackusa Over a year ago

I don't think saveHTML() needs any parameters passed in. 3v4l.org/o0K0H

Gordon Over a year ago

@mickmackusa you can pass a DOMNode to saveHTML. Then it will just use that tree.

Gordon Over a year ago

But yes, you don't need it when using LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD

Vincent Decaux · Accepted Answer · 2024-08-06 08:43:44Z

1

I answer to quote 2 things:

@bobble bubble is right when he said you can parse small pieces of HTML using Regex, especially when you are sure about the encoding / language...
You can use ChatGPT when you deal with Regex, it works well when you need something simple.

Here is my answer:

$regex = '/<li[^>]*>(?:\s*|(?:<[^>\/]+[^>]*>\s*<\/[^>]+>)(?:\s*|<\/?\w+[^>]*>\s*))<\/li>/s';
$contenuto = preg_replace($regex, '', $contenuto);

edited Aug 6, 2024 at 8:43

answered Aug 6, 2024 at 8:11

Vincent Decaux

10.9k7 gold badges67 silver badges103 bronze badges

3 Comments

KIKO Software Over a year ago

Your RegEx doesn't work when I change the  to a : 3v4l.org/EGb2j You may argue that that's not what's asked, but I think it is. The question talks about "empy tags" and gives  just as an example.

Vincent Decaux Over a year ago

@KIKOSoftware indeed ! I misread the question sorry, updated, it seems to work with any empty tag now

KIKO Software Over a year ago

Yes, you can adapt your RegEx as long as you know what to adapt to. The problem with HTML is that it can vary a lot. Suppose the  tag surrounds the  tag, then your expression fails again: 3v4l.org/0Gmf3 This can go on endlessly. Your expression will become so complicated that no sane person can understand it anymore. Try the other answer here, note how it can cope with any HTML and is still, somewhat, comprehensible.

Collectives™ on Stack Overflow

preg_replace help to remove empty "li" element [duplicate]

2 Answers 2

4 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Linked

Related