Filtering comments for swear words or spam

Forum Moderators: coopster

Message Too Old, No Replies

Filtering comments for swear words or spam

npulis

3:26 pm on Mar 2, 2010 (gmt 0)

Hi there,

Can anyone suggest a good method on how to filter out comments from spam and swear words?

Thanks.

CyBerAliEn

4:48 pm on Mar 2, 2010 (gmt 0)

A really simple solution:

$words = array('apple','orange','banana');
$string = 'How was your apple? I like oranges! But bananas are even better!';
$fixed = str_ireplace($words,'(SNIP)',$string);
echo "String: $string<br><br>Fixed: $fixed<br>";

This will take your input "string" and check it for all the "bad words" in 'words', and remove them from the string. It is case-insensitive.

The above will output:

String: How was your apple? I like oranges! But bananas are even better!

Fixed: How was your (SNIP)? I like (SNIP)s! But (SNIP)s are even better!

This will help to prevent obvious bad words. But you will need more advanced code to handle something entering a bad word with a random character in the middle, etc. You could try searching 'advanced bad word filter php' to try and find a free version online.

CyBerAliEn

4:55 pm on Mar 2, 2010 (gmt 0)

Oh, and to add... filtering for spam is a lot more difficult. Good spam detection requires advanced algorithms.

The simplest approach is to check the string for any obvious "spam" like words, ie: viagra, free #*$!, cialas, etc. The majority of spam contains these types of words. Check if the comment has such words and block ones that do.

The more effective approach is to use CAPTCHA images to "verify" the user is a human. These are those hard to read images where you have to enter a code/sequence to submit a comment. On a personal level, I hate these, I can't tell you how many times I've entered something in "wrong" because a '9' looked like a 'g' or such, I'm waiting for the day something better comes along --- but they are by large one of the most effective ways of reducing/eliminating spam.

Readie

5:24 pm on Mar 2, 2010 (gmt 0)

I think a regular expression should be quite able to take care of people miss-spelling undesired words, but my attempt at writing one got so big I think that there must be a better way of phrasing it.

Anyway, I personally think that writing (SNIP) looks intrusive, so to replace undesired words with the random symbols that you often see, this script should suffice:

$words = array('apple','orange','banana');
$string = 'How was your apple? I like oranges! But bananas are even better!';
$symb = array('!', '�', '$', '%', '&', '@', '?', '#');
$i = 0;
foreach($words as $word) {
$repl[$i] = '';
$len = strlen($word);
for($o = 0; $o < $len; $o++) {
$count = count($symb);
$ne = $count - 1;
$new = rand(0, $ne);
$repl[$i] .= $symb[$new];
}
$i++;
}
$fixed = str_ireplace($words,$repl,$string);
echo "String: $string<br><br>Fixed: $fixed<br>";

Readie

5:58 pm on Mar 2, 2010 (gmt 0)

After a little tinkering, I came up with this script. It may take you a little while to get your head around the regular expressions I've used so that you can write your own, but once you do it should take care of your worries.

<?php

$words = array(
'/([^ ]{1})?a([^a]{1})?pp([^p]{1})?l([^l]{1})?e([^ ]{1})?/i',
'/([^ ]{1})?o([^o]{1})?r([^r]{1})?a([^a]{1})?n([^n]{1})?g([^g]{1})?e([^ ]{1})?/i',
'/([^ ]{1})?b([^b]{1})?a([^a]{1})?n([^n]{1})?a([^a]{1})?n([^n]{1})?a([^ ]{1})?/i'
);

$symb = array('!', '�', '$', '%', '&', '@', '?', '#');

$string = 'How was your applee? I like ora nges! But banaanas are even better!';

echo '<p>Original: ' . $string . '</p>';

foreach($words as $word){
preg_match_all($word, $string, $out, PREG_PATTERN_ORDER);
$count = count($out[0]);
for($i = 0; $i < $count; $i++) {
$repl[$i] = '';
$len = strlen($out[0][$i]);
for($o = 0; $o < $len; $o++) {
$coun = count($symb);
$ne = $coun - 1;
$new = rand(0, $ne);
$repl[$i] .= $symb[$new];
}
$string = str_ireplace($out[$i], $repl[$i], $string);
}
}
echo '<p>Fixed: ' . $string . '</p>';

?>

rocknbil

8:16 pm on Mar 2, 2010 (gmt 0)

from spam and swear words?

These are two distinctly different types of inputs and simply must be handled differently.

Note that the following suggestions only occur AFTER you have done a precursory filtering/cleansing of the input data.

Spam: I have never **had** to implement a Captcha and hope never to resort to this. For one, it is (what I consider) an unnecessary barrier for your users; second, I have **seen** it circumnavigated by bots in vBulletin installs. Last, I prefer to at least try to address the problem at the root, take away the reason they are doing it.

There are many reasons most spammers attack your site, but the primary two are email injection, using your mailer to send out spam, "borrowing" your bandwidth, and link dropping in a variety of formats. somewhere you maintain a list, at the very least an array, of bad patterns:


$bad_patterns = Array ( 
 'b*cc\s*:', // attempted malformed injection 
 'to\s*:', 
 'content\-type', 
 '\[\s*URL.*\]*', // attempted link dropping 
 '\[\s*LINK.*\]*', 
 '\%5B\s*URL.*(\%5D)*', 
 // Etc. . . more here ....  
 // "example.com" is YOUR site. A common attack is 
 // "anything" @example.com to make it look like it 
 // came from YOU 
 'example.com' 
);

Then you use those in a preg match. I do this in a logging function, where all input is logged somewhere (also a Very Good Idea.)


foreach ($bad_patterns as $v) { 
 if (preg_match("/$v/i",$value)) {  
 $trap .= "SPAM: $value found in " . $key . " field.\n"; 
 $spam_in = 1; 
 } 
}

In the previous example, if spam_in==1 exit with a simple "no email was sent" message.

Word filter: Once it passes basic cleansing and spam pattern filtering then you can go on to filter words. You'll never get all cases of someone trying to circumnavigate the filtering, and if you do, it will be guaranteed to be "high maintenance" or will take many casualties with it, nixing ordinary input. One popular freelance site is going through this now, obscuring valuable information to prevent people from entering phone numbers to directly circumnavigate their system.

For example, s p a c i n g out the bad words would be terribly difficult to beat, and almost not worth it. Once they see their words are filtered, they make a choice: risk being banned or knock it off.

I'd probably approach it like so, not working code but a way to get going.

$badwds = get_bad_wds();
// Your list of bad words, stored in DB, plain text, whatever
$mytext = filter_bad_words($mytext,$badwds);


function filter_bad_words($list,$input) { 
 $clean=$is_bad=NULL; 
 $word_input = preg_split('/\s+/',$input); // or explode 
 foreach ($word_input as $wd) { 
 //gets rid of t*h*i*s, t-h-i-s, and similar 
 $is_bad=0; 
 $wd = preg_replace('/[^a-z0-9\.,\'"]+/i','',$word); 
 foreach ($list as $filter) { 
 if (preg_match("/$filter/i",$wd)) { 
 $wd = preg_replace('/./','#',$wd); 
 } 
 } 
 $clean .= ($is_bad==1)?$wd:$word; 
 $clean . " " ; 
 } 
 return $clean; 
}

As said, not working code as it needs to address possible markup input, paragraphs, and the preg on $wd is likely incomplete, but this will all vary based on what you're trying to do.

To (try to) beat my "spacing" example above, do a match/replace on the entire block with spaces removed, that's a bit of a kludge but can bear some value . . .

Summary:

- Cleanse input first
- Look for spam input patterns, if found, exit
- filter out bad words
- proceed with your regularly scheduled programming . . .