PHP - Parsing URL's in a message while ignoring all HTML Tags

Question

I am trying to process messages in a small, private, ticketing system that will automatically parse URL's into clickable links without messing up any HTML that may be posted. Up until now, the function to parse URL's has worked well, however one or two users of the system want to be able to post embedded images rather than as attachments.

This is the existing code that converts strings into clickable URL's, please note I have limited knowledge of regex and have relied on some assistance from others to build this

    $text = preg_replace(
     array(
       '/(^|\s|>)(www.[^<> \n\r]+)/iex',
       '/(^|\s|>)([_A-Za-z0-9-]+(\\.[A-Za-z]{2,3})?\\.[A-Za-z]{2,4}\\/[^<> \n\r]+)/iex',
       '/(?(?=<a[^>]*>.+<\/a>)(?:<a[^>]*>.+<\/a>)|([^="\']?)((?:https?):\/\/([^<> \n\r]+)))/iex'
     ),  
     array(
       "stripslashes((strlen('\\2')>0?'\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>&nbsp;\\3':'\\0'))",
       "stripslashes((strlen('\\2')>0?'\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>&nbsp;\\4':'\\0'))",
       "stripslashes((strlen('\\2')>0?'\\1<a href=\"\\2\" target=\"_blank\">\\3</a>&nbsp;':'\\0'))",
     ), $text);

    return $text;

How would I go about modifying an existing function, such as the one above, to exclude hits wrapped in HTML tags such as <img without hurting the functionality of the it.

Example:

`<img src="https://example.com/image.jpg">`

turns into

`<img src="<a href="https://example.com/image.jpg" target="_blank">example.com/image.jpg</a>">`

I have done some searching before posting, the most popular hits I am turning up are;

Obviously the common trend is "This is the wrong way to do it" which is obviously true - however while I agree, I also want to keep the function quite light. The system is used privately within the organisation and we only wish to process img tags and URL's automatically using this. Everything else is left plain, no lists, code tags quotes etc.

I greatly appreciate your assistance here.

Summary: How do I modify an existing set of regular expression rules to exclude matchs found within an img or other html tag found within a block of text.

If we are going to break good advice... Do we get to see a few sample inputs that might cause trouble? Let's see some input data and some expected output. — mickmackusa
– mickmackusa ♦, Commented Aug 21, 2017 at 1:46
Why not use a lib? github.com/jmrware/LinkifyURL/blob/master/linkify.php — Lawrence Cherone
– Lawrence Cherone, Commented Aug 21, 2017 at 1:56
@LawrenceCherone The OP wants to keep the process "light" and avoid libraries and other such reliable things. — mickmackusa
– mickmackusa ♦, Commented Aug 21, 2017 at 1:58

mickmackusa · Accepted Answer · 2017-08-21 04:46:49Z

From what I can gather from the \e modifier error, your php version can be a maximum of only PHP5.4. preg_replace_callback() is available from PHP5.4 and up -- so it may be a tight squeeze!

While I would not like to be roped into a big back-and-forth with a multitude of answer edits, I would like to give you some traction.

My method to follow is certainly not something I would stake my career on. And as stated in comments under the question and in many, many pages on SO -- HTML should not be parsed by REGEX. (disclaimer complete)

PHP5.4.34 Demo Link & Regex Pattern Demo Link

$text='This has an img tag <img src="https://example.com/image.jpg"> that should be igrnored.
This is an img that needs to become a tag: https://example.com/image.jpg.
This is a <a href="https://www.example.com/image" target="_blank">tagged link</a> with target.
This is a <a href="https://example.com/image?what=something&when=something">tagged link</a> without target.
This is an untagged url http://example.com/image.jpg.
(Please extend this battery of test cases to isolate any monkeywrenching cases)
Another short url example.com/
Another short url example.com/index.php?a=b&c=d
Another www.example.com';
$pattern='~<(?:a|img)[^>]+?>(*SKIP)(*FAIL)|(((?:https?:)?(?:/{2})?)(w{3})?\S+(\.\S+)+\b(?:[?#&/]\S*)*)~';
function taggify($m){
    if(preg_match('/^bmp|gif|png|je?pg/',$m[4])){  // add more filetypes as needed
        return "<img src=\"{$m[0]}\">";
    }else{
        //var_export(parse_url($m[0]));  // if you need to do preparations, consider using parse_url()
        return "<a href=\"{$m[0]}\" target=\"_blank\">{$m[0]}</a>";
    }
}
$text=preg_replace_callback($pattern,'taggify',$text);
echo $text;

Output:

This has an img tag <img src="https://example.com/image.jpg"> that should be igrnored.
This is an img that needs to become a tag: <img src="https://example.com/image.jpg">.
This is a <a href="https://www.example.com/image" target="_blank">tagged link</a> with target.
This is a <a href="https://example.com/image?what=something&when=something">tagged link</a> without target.
This is an untagged url <img src="http://example.com/image.jpg">.
(Please extend this battery of test cases to isolate any monkeywrenching cases)
Another short url <a href="example.com/" target="_blank">example.com/</a>
Another short url <a href="example.com/index.php?a=b&c=d" target="_blank">example.com/index.php?a=b&c=d</a>
Another <a href="www.example.com" target="_blank">www.example.com</a>

The SKIP-FAIL technique works to "disqualify" unwanted matches. The qualifying matches will be expressed by the section of the pattern that follows the pipe (|) after (*SKIP)(*FAIL)

Thank you for this, it also gives me a better understanding of how it is working =)

Collectives™ on Stack Overflow

PHP - Parsing URL's in a message while ignoring all HTML Tags

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related