PHP - text between tags

Question

For the purpose of translating a website, i need to find out text that are between html tags.

My first approach was to use regex, but it's not enough flexible. The closest that i was able to get with regex was: http://regex101.com/r/qB6xU5/1

but it only fail in the last test, matching p tags in one match instead of two

I consider using dom parser library but wasn't able (in very little search) to find one that can fulfill my needs.

Not to mention that the html may be with error and smarty templating tags.

Here is some example cases and results that should pass:

<div>test</div> => test
<div><br />test</div> => <br />test
<div>te<br />st</div> => te<br />st
<div>test<br /></div> => test<br />
<div><span>my</span>test</div> => <span>my</span>test
<div>test<span>my</span></div> => test<span>my</span>
<div>test<span>my</span>test</div> => test<span>my</span>test
<div><span>my</span>test<span>my</span></div> => <span>my</span>test<span>my</span>

In small word it can be rephrased as it: Find the content of an html tags containing at least one string that is not enclosed in some tags.

Parsing HTML with regex is not going to work - it's too complex. Here's a ton of great info on using parsers: stackoverflow.com/questions/3577641/… — Surreal Dreams
– Surreal Dreams, Commented Sep 12, 2014 at 15:22
Don't use regular expressions to parse HTML. Use a proper HTML parsing module. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php or this SO thread for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester
– Andy Lester, Commented Sep 12, 2014 at 15:22
I only see <div></div> removed... why don't you just str_replace the DIVs ? — mariobgr
– mariobgr, Commented Sep 12, 2014 at 15:23
@mariobgr: This was a simple example, i'm parsing a ton of html content. — Cyrbil
– Cyrbil, Commented Sep 12, 2014 at 15:43

ProGM · Accepted Answer · 2014-09-12 15:30:06Z

1

Don't use a regexp. Use an HTML parser!

Here's an example with PHP Simple HTML DOM Parser, but you can do it with what you prefer:

$html = str_get_html('<div>test<br /></div>');
$div = $html->first_child(); // Here's the div
$result = "";
for($children = $div->first_child; $children; $children = $children->next_sibling()) {
  $result += $children;
}
echo $result; // => "test<br />"

answered Sep 12, 2014 at 15:30

ProGM

7,1284 gold badges36 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cyrbil Over a year ago

How could i check if sibling is a text or an element ?

ProGM Over a year ago

@cyrbil You can check here the reference. To achieve what you want you can do something like: $subling."" == $sibling->plaintext

Cyrbil Over a year ago

Ok, simpleDomParser's documentation was a bit deprecated... But i manage to get through. Also that it doesn't have the ability to check for text i manage to do it by walking recursivly into the dom, and watching if an element contain a piece of text not between tags. To check i remove every tags with a regex (as i also want to remove tag content, strip_tags isn't enough) then check if i am left with some text.

Cyrbil · Accepted Answer · 2014-09-16 08:16:42Z

For the record here is the complete code. Some regex may not be necessary in some cases. But i needed them all ;)

<?php
include("simple_php_dom.php");

// load html content to parse
$html_str = file_get_contents("myfile.tpl");
$html = str_get_html($html_str);

// extract strings
parse($html, $results);
var_dump($results); // simply display

/**
 * Parse html element and find every text not between tags
 * @param $elem DOM element to parse
 * @param $results array
 */
function parse($elem, &$results) {
    // walk though every nodes
    foreach($elem->childNodes() as $child) {
        // get sub children
        $children = $child->childNodes();

        // get inner content
        $content = $child->innertext;

        // remove starting and ending self closing elements or smarty tags
        $content = preg_replace('/(^(\s*<[^>]*?\/\s*>)+)|((<[^>]*?\/\s*>\s*)+$)/s', '', $content);
        $content = preg_replace('/(^(\s*{[^}]*?})+)|((\{[^}]*?\}\s*)+$)/s', '', $content);
        $content = trim($content);

        // remove all elements and smarty tags
        $text = preg_replace('/<(\w+)[^>]*>.*<\s*\/\1\s*>/', '', $content); // remove elements
        $text = preg_replace('/<\/?.*?\/?>/', '', $text); // remove self closing elements
        $text = preg_replace('/\{.*?\}/', '', $text); // remove smarty tags
        $text = preg_replace('/[^\w]/', '', $text); // remove non alphanum characters
        $text = trim($text);

        // no children, we are at a leaf and it's probably a text
        if(empty($children)) {
            // check if not empty string and exclude comments styles and scripts
            if(!empty($text) && in_array($child->tag, array("comment","style","script")) === false) {
                // add to results
                $results[] = $content;
            }
        }
        // if we are on a branch but in contain text not inside tags
        elseif(!empty($text)) {
            // add to results
            $results[] = $content;
        } else {
            // recursive call with sub element
            parse($child, $results);
        }
    }
}

Collectives™ on Stack Overflow

PHP - text between tags

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related