4

I am trying to parse URLs containing & with preg_replace.

$content = preg_replace('#https?://[a-z0-9._/\?=&-]+#i', '<a href="$0" target="_blank">$0</a>', $content);

But I use it for user comments, so I'm also using htmlspecialchars() function to prevent XSS.

function formatContributionContent($content)
{
    $content = nl2br(htmlspecialchars($content));

    // Regexp for mails
    $content = preg_replace('#[a-z0-9._-]+@[a-z0-9._&-]{2,}\.[a-z]{2,4}#', '<a href="mailto:$0">$0</a>', $content);

    // Regexp for urls
    $content = preg_replace('#https?://[a-z0-9._/\?=&-]+#i', '<a href="$0" target="_blank">$0</a>', $content);

    var_dump($content);
}

formatContributionContent('https://openclassrooms.com/index.php?page=3&skin=blue');

And htmlspecialchars transforms & into "&amp;" so my regexp produce a wrong result. Indeed, with the following URL.

http://www.siteduzero.com/index.php?page=3&skin=blue

I obtain ;

<a href="https://openclassrooms.com/index.php?page=3&amp" target="_blank">https://openclassrooms.com/index.php?page=3&amp</a>;skin=blue
3
  • 2
    You cannot expect your regular expression to somehow magically heal content you modified before handing it over. Instead you would first have to make your replacements and then maybe use the htmlspecialchars() method to output the result. But probably you would have to apply it to the separate parts of that URL, not to the whole URL, since it would obviously turn the URL into its readable notation instead of rendering it in a usable way. So your whole approach won't work. You'd have to split that URL first and handle the tokens separately. Commented Sep 6, 2015 at 7:55
  • I would like to transform urls into links in user comments. Commented Sep 6, 2015 at 8:44
  • Assuming that you don't want any HTML tag from user input to be rendered, you need to use regex to pick out the positions of the emails and links, then use it to tokenize the input. Whatever not email or link will be entity-escaped, email and links are put into anchors, then we merge them together. (What I say here is an elaboration on what arkascha said) Commented Sep 7, 2015 at 4:17

1 Answer 1

1

You can add ";" in the list of characters matched by your regexp, like this :

$content = preg_replace('#https?://[a-z0-9._/\?=&;-]+#i', '<a href="$0" target="_blank">$0</a>', $content);

This way, "&" characters are transformed in "&amp;" by htmlspecialchars, but your regexp can find the whole url.

Sign up to request clarification or add additional context in comments.

5 Comments

Although this solves OP's problem it creates another one: using this regex a website such as this https://;.com is valid.
Thanks ! :) It looks so ridiculous. ^^"
@Pedro Pinheiro : That's true, but the original regex didn't validate the url neither, and a url like https://.com was already valid.
For the moment i wasn't trying to have a url validator, but just using preg_replace with htmlspecialchars.
This is a terrible solution. You are processing a HTML-escaped string with a regex which is not aware of HTML-escaped syntax

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.