0

i have some simple code that does a preg match:

$bad_words = array('dic', 'tit', 'fuc',); //for this example i replaced the bad words

for($i = 0; $i < sizeof($bad_words); $i++)
{
    if(preg_match("/$bad_words[$i]/", $str, $matches))
    {
        $rep = str_pad('', strlen($bad_words[$i]), '*');
        $str = str_replace($bad_words[$i], $rep, $str);
    }
}
echo $str;

So, if $str was "dic" the result will be '*' and so on.

Now there is a small problem if $str == f.u.c. The solution might be to use:

$pattern = '~f(.*)u(.*)c(.*)~i';
$replacement = '***';
$foo =  preg_replace($pattern, $replacement, $str);

In this case i will get ***, in any case. My issue is putting all this code together.

I've tried:

$pattern = '~f(.*)u(.*)c(.*)~i';
$replacement = 'fuc';
$fuc =  preg_replace($pattern, $replacement, $str);

$bad_words = array('dic', 'tit', $fuc,); 

for($i = 0; $i < sizeof($bad_words); $i++)
{
    if(preg_match("/$bad_words[$i]/", $str, $matches))
    {
        $rep = str_pad('', strlen($bad_words[$i]), '*');
            $str = str_replace($bad_words[$i], $rep, $str);
    }
}
echo $str;

The idea is that $fuc becomes fuc then I place it in the array then the array does its jobs, but this doesn't seem to work.

1 Answer 1

3

First of all, you can do all of the bad word replacements with one (dynamically generated) regex, like this:

$bad_words = array('dic', 'tit', 'fuc',);

$str = preg_replace_callback("/\b(?:" . implode( '|', $bad_words) . ")\b/", 
    function( $match) {
        return str_repeat( '*', strlen( $match[0])); 
}, $str);

Now, you have the problem of people adding periods in between the word, which you can search for with another regex and replace them as well. However, you must keep in mind that . matches any character in a regex, and must be escaped (with preg_quote() or a backslash).

$bad_words = array_map( function( $el) { 
    return implode( '\.', str_split( $el));
}, $bad_words);

This will create a $bad_words array similar to:

array(
    'd\.i\.c',
    't\.i\.t',
    'f\.u\.c'
)

Now, you can use this new $bad_words array just like the above one to replace these obfuscated ones.

Hint: You can make this array_map() call "better" in the sense that it can be smarter to catch more obfuscations. For example, if you wanted to catch a bad word separated with either a period or a whitespace character or a comma, you can do:

$bad_words = array_map( function( $el) { 
    return implode( '(?:\.|\s|,)', str_split( $el));
}, $bad_words);

Now if you make that obfuscation group optional, you'll catch a lot more bad words:

$bad_words = array_map( function( $el) { 
    return implode( '(?:\.|\s|,)?', str_split( $el));
}, $bad_words);

Now, bad words should match:

f.u.c
f,u.c
f u c 
fu c
f.uc

And many more.

Sign up to request clarification or add additional context in comments.

4 Comments

can u put the array_map way into a public static function cleanStr($str)() method? Or is the $el a array or bad words?
$el is an individual array element. You can put the logic into a function, but you're not cleaning a string, you're turning your $bad_words array into a more regex-friendly array that is capable of replacing many obfuscations.
Take this string for example: i love dictionaries with titles on the top of the page, also shop at fuccillo hyundai! No bad words in this. However, it would return i love ***tionaries with ***les on the top of the page, also shop at ***cillo hyundai! And also it should be $match[0] not $match[1] in your first code block.
That's an easy fix - You need word boundaries. I've updated my answer and fixed the $match[0].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.