0

For the purpose of translating a website, i need to find out text that are between html tags.

My first approach was to use regex, but it's not enough flexible. The closest that i was able to get with regex was: http://regex101.com/r/qB6xU5/1

but it only fail in the last test, matching p tags in one match instead of two

I consider using dom parser library but wasn't able (in very little search) to find one that can fulfill my needs.

Not to mention that the html may be with error and smarty templating tags.

Here is some example cases and results that should pass:

  • <div>test</div> => test
  • <div><br />test</div> => <br />test
  • <div>te<br />st</div> => te<br />st
  • <div>test<br /></div> => test<br />
  • <div><span>my</span>test</div> => <span>my</span>test
  • <div>test<span>my</span></div> => test<span>my</span>
  • <div>test<span>my</span>test</div> => test<span>my</span>test
  • <div><span>my</span>test<span>my</span></div> => <span>my</span>test<span>my</span>

In small word it can be rephrased as it: Find the content of an html tags containing at least one string that is not enclosed in some tags.

8
  • 6
    Have you tried an HTML parser? Commented Sep 12, 2014 at 15:20
  • 2
    Parsing HTML with regex is not going to work - it's too complex. Here's a ton of great info on using parsers: stackoverflow.com/questions/3577641/… Commented Sep 12, 2014 at 15:22
  • Don't use regular expressions to parse HTML. Use a proper HTML parsing module. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php or this SO thread for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. Commented Sep 12, 2014 at 15:22
  • I only see <div></div> removed... why don't you just str_replace the DIVs ? Commented Sep 12, 2014 at 15:23
  • @mariobgr: This was a simple example, i'm parsing a ton of html content. Commented Sep 12, 2014 at 15:43

2 Answers 2

1

Don't use a regexp. Use an HTML parser!

Here's an example with PHP Simple HTML DOM Parser, but you can do it with what you prefer:

$html = str_get_html('<div>test<br /></div>');
$div = $html->first_child(); // Here's the div
$result = "";
for($children = $div->first_child; $children; $children = $children->next_sibling()) {
  $result += $children;
}
echo $result; // => "test<br />"
Sign up to request clarification or add additional context in comments.

3 Comments

How could i check if sibling is a text or an element ?
@cyrbil You can check here the reference. To achieve what you want you can do something like: $subling."" == $sibling->plaintext
Ok, simpleDomParser's documentation was a bit deprecated... But i manage to get through. Also that it doesn't have the ability to check for text i manage to do it by walking recursivly into the dom, and watching if an element contain a piece of text not between tags. To check i remove every tags with a regex (as i also want to remove tag content, strip_tags isn't enough) then check if i am left with some text.
0

For the record here is the complete code. Some regex may not be necessary in some cases. But i needed them all ;)

<?php
include("simple_php_dom.php");

// load html content to parse
$html_str = file_get_contents("myfile.tpl");
$html = str_get_html($html_str);

// extract strings
parse($html, $results);
var_dump($results); // simply display

/**
 * Parse html element and find every text not between tags
 * @param $elem DOM element to parse
 * @param $results array
 */
function parse($elem, &$results) {
    // walk though every nodes
    foreach($elem->childNodes() as $child) {
        // get sub children
        $children = $child->childNodes();

        // get inner content
        $content = $child->innertext;

        // remove starting and ending self closing elements or smarty tags
        $content = preg_replace('/(^(\s*<[^>]*?\/\s*>)+)|((<[^>]*?\/\s*>\s*)+$)/s', '', $content);
        $content = preg_replace('/(^(\s*{[^}]*?})+)|((\{[^}]*?\}\s*)+$)/s', '', $content);
        $content = trim($content);

        // remove all elements and smarty tags
        $text = preg_replace('/<(\w+)[^>]*>.*<\s*\/\1\s*>/', '', $content); // remove elements
        $text = preg_replace('/<\/?.*?\/?>/', '', $text); // remove self closing elements
        $text = preg_replace('/\{.*?\}/', '', $text); // remove smarty tags
        $text = preg_replace('/[^\w]/', '', $text); // remove non alphanum characters
        $text = trim($text);

        // no children, we are at a leaf and it's probably a text
        if(empty($children)) {
            // check if not empty string and exclude comments styles and scripts
            if(!empty($text) && in_array($child->tag, array("comment","style","script")) === false) {
                // add to results
                $results[] = $content;
            }
        }
        // if we are on a branch but in contain text not inside tags
        elseif(!empty($text)) {
            // add to results
            $results[] = $content;
        } else {
            // recursive call with sub element
            parse($child, $results);
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.