0

I need to perform a recursive str_replace on a portion of HTML (with recursive I mean inner nodes first), so I wrote:

$str = //get HTML;
$pttOpen = '(\w+) *([^<]{1,100}?)';
$pttClose = '\w+';
$pttHtml = '(?:(?!(?:<x-)).+)';

while (preg_match("%<x-(?:$pttOpen)>($pttHtml)*</x-($pttClose)>%m", $str, $match)) {
    list($outerHtml, $open, $attributes, $innerHtml, $close) = $match;
    $newHtml = //some work....
    str_replace($outerHtml, $newHtml, $str);
}

The idea is to first replace non-nested x-tags. But it only works if innerHtml in on the same line of the opening tag (so I guess I misunderstood what the /m modifier does). I don't want to use a DOM library, because I just need simple string replacement. Any help?

7
  • 1
    Can you add an example in your question with an expected output please ? Commented Feb 12, 2014 at 13:49
  • @CasimiretHippolyte Same question at nearly the same time ! :) Commented Feb 12, 2014 at 13:49
  • the modifier m changes the meaning of the anchors ^ and $ (that you don't use) to "start of the line" and "end of the line". Commented Feb 12, 2014 at 13:58
  • The "work" you made with $newHTML can be useful too. Commented Feb 12, 2014 at 14:04
  • Operations on HTML code -> always use a DOM parser, not regex. (xpath, domdocument, simplexml, sax..) Commented Feb 12, 2014 at 14:32

3 Answers 3

1

Try this regex:

%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s

Demo

http://regex101.com/r/nA2zO5

Sample code

$str = // get HTML
$pattern = '%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s';

while (preg_match($pattern, $str, $matches)) {
    $newHtml =  sprintf('<ns:%1$s>%2$s</ns:%1$s>', $matches['open'], $matches['innerHtml']);
    $str = str_replace($matches[0], $newHtml, $str);
}

echo htmlspecialchars($str);

Output

Initially, $str contained this text:

<x-foo>
    sdfgsdfgsd
       <x-bar>
           sdfgsdfg
       </x-bar>
       <x-baz attr1='5'>
           sdfgsdfg
       </x-baz>
    sdfgsdfgs
</x-foo>

It ends up with:

<ns:foo>
   sdfgsdfgsd
   <ns:bar>
       sdfgsdfg
   </ns:bar>
   <ns:baz>
       sdfgsdfg
   </ns:baz>
   sdfgsdfgs
</ns:foo>

Since, I didn't know what work is done on $newHtml, I mimic this work somehow by replacing x-with ns: and removing any attributes.

Sign up to request clarification or add additional context in comments.

1 Comment

@CasimiretHippolyte I have updated my answer with your remarks.
1

Thanks to @Alex I came up with this:

%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*?)>(?P<innerHtml>((?!<x-).)*)</x-(?P=open)>%is

Without the ((?!<x-).)*) in the innerHtml pattern it won't work with nested tags (it will first match outer ones, which isn't what I wanted). This way innermost ones are matched first. Hope this helps.

Comments

1

I don't know exactly what kind of changes you are trying to do, however this is the way I will proceed:

$pattern = <<<'EOD'
~
    <x-(?<tagName>\w++) (?<attributes>[^>]*+) >
    (?<content>(?>[^<]++|<(?!/?x-))*) #by far more efficient than (?:(?!</?x-).)*
    </x-\g<tagName>>
~x
EOD;

function callback($m) { // exemple function
    return '<n-' . $m['tagName'] . $m['attributes'] . '>' . $m['content']
         . '</n-' . $m['tagName'] . '>';       
};

do {
    $code = preg_replace_callback($pattern, 'callback', $code, -1, $count);
} while ($count);


echo htmlspecialchars(print_r($code, true));

1 Comment

@Alex: It is between Yves Saint Laurent and Wordpress.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.