2

I have no experience using regular expressions in PHP, so I usually write some convoluted function using a series of str_replace(), substr(), strpos(), strstr() etc (you get the idea).

This time I want to do this correctly, I know I need to use a regex for this, but am confused as to what to use (ereg or preg), and how exactly the syntax should be.

NOTE: I am NOT parsing HTML, or XML, and sometimes I will be using delimiters other than (for example, | or ~ or [tag] or ::). I am looking for a generic way to do a wildcard replace in between two known delimiters using regex, I am not building an HTML or XML parser.

What I need is a regex that replaces this:

<sometag>everything in here</sometag>

with this:

<sometag>new contents</sometag>

I have read the documentation online for a bit, but I am confused, and am hoping one of you regex experts can pop in a simple solution. I suspect I will pass the values to a function, something like this:

$new_text = swapText ( "<sometag>", $the_new_text_to_go_into_the_dag );

function swapText ( $in_tag_with_brackets_to_update, $in_new_text ) {
 // define tags
 $starting_tag  = $in_tag_with_brackets_to_update;
 $ending_tag    = str_replace( "<", "</", $in_tag_with_brackets_to_update) );

 // not sure if this is the proper regex match string or not
 // and/or if any escaping needs to be done on the tags
 $find_string         = "{$starting_tag}.*{$ending_tag}";
 $replace_with_string = "{$starting_tag}{$in_new_text}{$ending_tag}";

 // after some regex, this function should return new version of <tag>data</tag>
}

Thanks.

7
  • 4
    Please use a parser: stackoverflow.com/questions/1732348/… Commented Nov 29, 2009 at 17:03
  • 1
    thanks BalusC, but I am not trying to parse HTML, although I can see how my question may lead you to believe that. Commented Nov 29, 2009 at 17:08
  • Are you making a template engine? Commented Nov 29, 2009 at 17:10
  • Galen - I am simply looking for a way to replace an unknown block of text inside a known set of delimiters (I used tags as an example of one of the many things I will be using as delimiters). Perhaps I should have used a different example for delimiters. Commented Nov 29, 2009 at 17:12
  • Even if your tags are not real HTML tags, a parser would still be a better way to go if they always follow the HTML/XML format. You should be able to find/replace everything within sometag easily. Commented Nov 29, 2009 at 17:14

4 Answers 4

10

You say that you are not going to parse xml and then goes on to show an xml example. That's a bit confusing.

Now, the reason why you can't use regular expressions to parse xml, is that they aren't contextual. Therefore there are a whole class of problems that regular expressions can't be used for. This includes nested tags (Whether they are xml or not), so keep that in mind.

That out of the way, you should be using preg - not ereg. ereg is a lesser used, slower and now deprecated type of regular expressions. Just forget about it.

In pcre (Perl Compatible Regular Expressions), which is the language that preg uses, a . (dot) is a wildcard, that matches any single character (Except newline). You can put a quantifier after a match. A quantifier can be an explicit range of numbers, such as {1,3} (meaning at least one, but up to 3) or you can use one of the short hand symbols, such as + (Short for {1,}, meaning at least one) or * (Meaning any number, including zero). With this knowledge, you can match anything with .*.

By default, expressions will match the largest possible pattern (Known as being greedy). You can change this with the ? modifier. Thus .*? will match anything, but take the shortest possible pattern. This can then be used to match any delimited value like follows:

~<foo>.*?</foo>~

Note that I'm using ~ as the delimiter here to avoid having to escape / in the expression. The standard is to use / as delimiter, in which case the expression would have looked like this:

/<foo>.*?<\/foo>/

In general, the above is bad practise, since it's much better to match a negated character class than a dot, but to keep things simple for you, just ignore this until you get the basics under your skin. It'll work in most cases. In particular, since the . doesn't match newlines, this won't work if the content contains a newline character. If you need this you can do one of two things: Either you add a modifier to the expression, or, you replace the . with a character class, that includes newlines. For example [\s\S] (Meaning a whitespace character or a non-whitespace character, which is the same as anything). This is how the expression would look then:

~<foo>.*?</foo>~s

Or:

~<foo>[\s\S]*?</foo>~

To put all this to work, let's pass it to the preg_replace function:

echo preg_replace('~<foo>.*?</foo>~s', '<foo>Lorem Ipsum</foo>', $input);

If your tag-names are variable, you can build the expression up like you would with an SQL query. Just like SQL, you need to escape certain characters. Use preg_quote for that:

function swapText($tagname, $replacement_text, $input) {
  $tagname_escaped = preg_quote($tagname, '~');
  return preg_replace(
    '~<' . $tagname_escaped . '>.*?</' . $tagname_escaped . '>~s',
    '<' . $tagname . '>' . $replacement_text . '</' . $tagname . '>',
    $input);
}
Sign up to request clarification or add additional context in comments.

2 Comments

Note that . matches anything except line breaks. Besides that, excellent answer!
thanks. I think it will do what I need, and based on your excellent explanations, I think I can re-purpose the swapText function to handle other kinds of delimiters I am using throughout my app. Thanks again!
3

@OP, there's no need to use complicated regex or a parser if your task is very simple. an example just using your normal substrings....

$mystr='<sometag>everything in here</sometag>';
$start=strpos($mystr,"<sometag>");
$end=strpos($mystr,"</sometag>");
print substr($mystr,0,$start+strlen("<sometag>") ) . "new value" . substr($mystr,$end);

1 Comment

thanks - thought the regex would work, but yours worked better and also with newline characters which the regex didn't.
1

First, if it is html you are replacing, use something like simple html dom. If the format is exactly what you say (as in, <sometag> can't be <sometag >), then regex may be ok to use.

Don't use ereg based functions, as they are deprecated, use the preg functions.

preg_replace('%(<sometag>)[^<]*(</sometag>)%i', '$1something else$2', $str);

EDIT
A slightly better version of the above, now supports having a < in the text

preg_replace('%(<sometag>).*?(</sometag>)%i', '$1something else$2', $str);

The $1 and $2 are the matched text between the brackets. As these are constant, they could be replaced with the constant

preg_replace('%<sometag>.*?</sometag>%i', '<sometag>something else</sometag>', $str);

5 Comments

the ending piece should be </sometag> not <sometag>. Do I need to escape a / with a \ (eg: <\/sometag>). Also what does [^<] do? Is it looking for text that starts with a < ? If so, that is not what I need. Thanks -
Fixed the end tag. [^<] matches all characters that are not '<'. Both examples fit your test data. If its not what you want, you need to explain more clearly what you do want.
This does not work, the slash in </sometag> will be seen as pattern delimiter resulting in a parse error. Either escape it or (better) use different pattern delimiters.
Please clarify the reason for the -1
The -1 was not from me, but your solution will fail if there are line breaks between the opening and closing tag.
0

I've written the following function to replace parts of a string by wildcard:

function wildcardReplace($String,$Search,$Filler,$Wildcard = '???'){

        list($startStr,$endStr) = explode($Wildcard,$Search);

        $start = strpos($String,$startStr);

        // Make sure the end point is the first closest match after the start string.   

        $endofstarter = strpos($String,$startStr) + strlen($startStr);

        $startofender = strpos(
                    substr($String,$endofstarter),
                    $endStr
                ) + $endofstarter;


        $Result = substr($String,0,$start+strlen($startStr) ) . $Filler. substr($String,$startofender);

        // Replace any remaining stuff

        $RemainingString = substr($String,$startofender);

        // If theres any matches left, replace them

        if(strpos($RemainingString,$startStr)>-1) $Result = str_replace($RemainingString,wildcardReplace($RemainingString,$Search,$Filler),$Result);

        return $Result;
}

Example use: $Output = wildcardReplace('<a href="http://www.youtube.com/watch?v=dQw4w9WgXcQ"><img src="rickroll.png" width="500"></a>','width="???"',350,'???')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.