Regexp with '&' char using preg_replace

Question

I am trying to parse URLs containing & with preg_replace.

$content = preg_replace('#https?://[a-z0-9._/\?=&-]+#i', '<a href="$0" target="_blank">$0</a>', $content);

But I use it for user comments, so I'm also using htmlspecialchars() function to prevent XSS.

function formatContributionContent($content)
{
    $content = nl2br(htmlspecialchars($content));

    // Regexp for mails
    $content = preg_replace('#[a-z0-9._-]+@[a-z0-9._&-]{2,}\.[a-z]{2,4}#', '<a href="mailto:$0">$0</a>', $content);

    // Regexp for urls
    $content = preg_replace('#https?://[a-z0-9._/\?=&-]+#i', '<a href="$0" target="_blank">$0</a>', $content);

    var_dump($content);
}

formatContributionContent('https://openclassrooms.com/index.php?page=3&skin=blue');

And htmlspecialchars transforms & into "&" so my regexp produce a wrong result. Indeed, with the following URL.

http://www.siteduzero.com/index.php?page=3&skin=blue

I obtain ;

<a href="https://openclassrooms.com/index.php?page=3&amp" target="_blank">https://openclassrooms.com/index.php?page=3&amp</a>;skin=blue

You cannot expect your regular expression to somehow magically heal content you modified before handing it over. Instead you would first have to make your replacements and then maybe use the htmlspecialchars() method to output the result. But probably you would have to apply it to the separate parts of that URL, not to the whole URL, since it would obviously turn the URL into its readable notation instead of rendering it in a usable way. So your whole approach won't work. You'd have to split that URL first and handle the tokens separately. — arkascha
– arkascha, Commented Sep 6, 2015 at 7:55
Assuming that you don't want any HTML tag from user input to be rendered, you need to use regex to pick out the positions of the emails and links, then use it to tokenize the input. Whatever not email or link will be entity-escaped, email and links are put into anchors, then we merge them together. (What I say here is an elaboration on what arkascha said) — nhahtdh
– nhahtdh, Commented Sep 7, 2015 at 4:17

scandel · Accepted Answer · 2015-09-06 08:24:58Z

1

You can add ";" in the list of characters matched by your regexp, like this :

$content = preg_replace('#https?://[a-z0-9._/\?=&;-]+#i', '<a href="$0" target="_blank">$0</a>', $content);

This way, "&" characters are transformed in "&" by htmlspecialchars, but your regexp can find the whole url.

answered Sep 6, 2015 at 8:24

scandel

1,8423 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Pedro Pinheiro Over a year ago

Although this solves OP's problem it creates another one: using this regex a website such as this https://;.com is valid.

Maluna34 Over a year ago

Thanks ! :) It looks so ridiculous. ^^"

scandel Over a year ago

@Pedro Pinheiro : That's true, but the original regex didn't validate the url neither, and a url like https://.com was already valid.

Maluna34 Over a year ago

For the moment i wasn't trying to have a url validator, but just using preg_replace with htmlspecialchars.

nhahtdh Over a year ago

This is a terrible solution. You are processing a HTML-escaped string with a regex which is not aware of HTML-escaped syntax

Collectives™ on Stack Overflow

Regexp with '&' char using preg_replace

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related