9

I have a simple message board, let's say: mywebsite.com, that allows users to post their messages. Currently the board makes all links clickable, ie. when someone posts something that starts with:

http://, https://, www., http://www., https://www.

then the script automatically makes them as links (ie. adds the A href.. tag).

THE PROBLEM - there is too much spam. So my idea is to automatically remove the above http|s/www so that these don't become 'clickable links.' HOWEVER, I want to allow posters to link to pages within my site, ie. not to remove http|s/www when the message contains link/s to mywebsite.com.

My idea was to create two arrays:

$removeParts = array('http://', 'https://', 'www.', 'http://www.', 'https://www.');

$keepParts = array('http://mywebsite.com', 'http://www.mywebsite.com', 'www.mywebsite.com', 'http://mywebsite.com', 'https://www.mywebsite.com', 'https://mywebsite.com');

but I don't know how to use them correctly (probably str_replace could work somehow).

Below is an example of $message which is before posting and after posting:

$message BEFORE:

Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2.

$message AFTER:

Hello world, thanks to http://mywebsite.com/about I learned a lot. I found you on bing.com, google.com/search and on some spamwebsite.com/refid=spammer2.


Please note the user enters clear text into the post form, so script should only work with this clear text (not a href etc.).

3
  • Check out this post: stackoverflow.com/questions/9364242/… Commented Apr 24, 2015 at 23:29
  • Yes, I know how to parse domain from URL, but here a message may contain both regular text and link/s... not just a link. Commented Apr 24, 2015 at 23:32
  • Note: the accepted answer on that link provides an answer to that question as well. Commented Apr 24, 2015 at 23:33

4 Answers 4

1
$url = "http://mywebsite/about";
$parse = parse_url($url);

if($parse["host"] == "mywebsite")
    echo "My site, let's mark it as link";

More info: http://php.net/manual/en/function.parse-url.php

Sign up to request clarification or add additional context in comments.

Comments

1

killSpam() function features:

  • works with single and double-quotes.
  • Invalid html
  • ftp://
  • http://
  • https://
  • file://
  • mailto:

function killSpam($html, $whitelist){

//process html links
preg_match_all('%(<(?:\s+)?a.*?href=["|\'](.*?)["|\'].*?>(.*?)<(?:\s+)?/(?:\s+)?a(?:\s+)?>)%sm', $html, $match, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($match[1]); $i++) {
    if(!preg_match("/$whitelist/", $match[1][$i])){
        $spamsite = $match[3][$i];
        $html = preg_replace("%" . preg_quote($match[1][$i]) . "%",  " (SPAM) ", $html);
    }
}

//process cleartext links
preg_match_all('/(\b(?:(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[A-Z0-9+&@#\/%?=~_|$!:,.;-]*[A-Z0-9+&@#\/%=~_|$-]|((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,6})\b)|"(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^"\r\n]+"|\'(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^\'\r\n]+\')/i', $html, $match2, PREG_PATTERN_ORDER);

for ($i = 0; $i < count($match2[1]); $i++) {
     if(!preg_match("/$whitelist/", $match2[1][$i])){
        $spamsite = $match2[1][$i];
        $html = preg_replace("%" . preg_quote($spamsite) . "%",  " (SPAM) ", $html);
    }
}


return $html;

}

Usage:

$html = <<< LOB
 <p>Hello world, thanks to <a href="http://mywebsite.com/about" rel="nofollow">http://mywebsite/about</a> I learned a lot. I found
  you on <a href="http://www.bing.com" rel="nofollow">http://www.bing.com</a>, <a href="https://google.com/search" rel="nofollow">https://google.com/search</a> and on some <a href="http://www.spamwebsite.com" rel="nofollow">www.spamwebsite.com/refid=spammer2< /a >. www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and [email protected], file://spamfile.com/file.txt ftp://spamftp.com/file.exe </p>
LOB;

$whitelist = "(google\.com|yahoo\.com|bing\.com|nicesite\.com|mywebsite\.com)";

$noSpam = killSpam($html, $whitelist);

echo $noSpam;

Spam Example:

I CANNOT POST THE SPAM HTML HERE, I GUESS SO HAS IS OWN killSpam()...- view it at http://pastebin.com/HXCkFeGn

Hello world, thanks to http://mywebsite/about I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2. www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and [email protected], file://spamfile.com/file.txt ftp://spamftp.com/file.exe


Output:

Hello world, thanks to (SPAM) I learned a lot. I found you on http://www.bing.com, https://google.com/search and on some (SPAM) . (SPAM) , (SPAM) , (SPAM) and (SPAM) , (SPAM) (SPAM)


Demo:

http://ideone.com/9IxFrB

7 Comments

Thanks, but please note that the input is clear, ie. user doesn't enter a href etc. so in your example the initial $html is: $html='Hello world, thanks to mywebsite/about I learned a lot. I found you on bing.com, google.com/search and on some www.spamwebsite.com/refid=spammer2.'; Would it work with this too?
You need to create a white list. I'll update the code.
I think it's better in reverse, ie. my $removeParts / $keepParts could be considered as whitelisted site, that should be easier I hope..
Updated 2: matches incorrect html. i.e.: < \ a >, < a
Ok, but the input does not contain a href tags (your code works with a href, rel=nofollow etc). User enters clear text and http/https/www parts should be removed from this clear text only...
|
0

If u want to preserve text of links, but make them "not clickable", u may try this code:

<?php

$text = <<<__text
   Hello world, thanks to http://mywebsite/about I learned a lot.
   I found you on http://www.bing.com, https://google.com/search and on some www.spamwebsite.com/refid=spammer2.
   www.spamme.com, http://morespam.com/?aff=122, http://crazyspammer.com/?money=22 and [email protected], file://spamfile.com/file.txt ftp://spamftp.com/file.exe
__text;
$allowed_domains = ['mywebsite.com'];

$pattern = "/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%@\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/";
preg_match_all($pattern, $text, $matches, PREG_SET_ORDER);
foreach ($matches as $m) {
    list(, $url, $scheme_and_domain, $scheme, $path) = $m;
    $domain = preg_replace(['/^' . preg_quote($scheme, '/') . '/i', "/^www./i"], '', $scheme_and_domain);

    if (in_array($domain, $allowed_domains)) continue;

    $url_prepared = rtrim("$domain$path", '/');
    $text = str_replace($url, $url_prepared, $text);
}

echo $text;

Codepad

Comments

0

For anyone looking for an answer - I posted a related (more specific) question which solved the problem: PHP - remove words (http|https|www|.com|.net) from string that do not start with specific words

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.