This is becoming a challenge, I'm curious if you guys and gals have any suggestions or feedback.
I'm specifically working on a profanity filter for my message board, replacing bad words with ****. People occasionally try to get around the filter, though, so I'm trying to figure a way to intuitively filter when someone uses a special character in place of a real letter.
For example:
@ss
@$$
$h!+
Or worse:
s÷%t (in context the meaning is clear, but replacing the à with A just turns it to gibberish)
But I DON'T want to catch, for example:
@gmail
#foo
you're
No!This (no space after the !)
I'm already manually swapping some characters to letters, like so:
%asciiChars = (
# Upside down
'592' =>'a',
'596' =>'c',
'477' =>'e',
'607' =>'f',
'613' =>'h',
'305' =>'i',
'1592' =>'j',
'670' =>'k',
'1503' =>'l',
'623' =>'m',
'633' =>'r',
'647' =>'t',
'652' =>'v',
'653' =>'w',
'654' =>'y',
# Uppercase
'65' =>'A',
'66' =>'B',
'67' =>'C',
'68' =>'D',
'69' =>'E',
'70' =>'F',
'71' =>'G',
'72' =>'H',
'73' =>'I',
'74' =>'J',
'75' =>'K',
'76' =>'L',
'77' =>'M',
'78' =>'N',
'79' =>'O',
'80' =>'P',
'81' =>'Q',
'82' =>'R',
'83' =>'S',
'84' =>'T',
'85' =>'U',
'86' =>'V',
'87' =>'W',
'88' =>'X',
'89' =>'Y',
'90' =>'Z',
# Lowercase
'97' =>'a',
'98' =>'b',
'99' =>'c',
'100' =>'d',
'101' =>'e',
'102' =>'f',
'103' =>'g',
'104' =>'h',
'105' =>'i',
'106' =>'j',
'107' =>'k',
'108' =>'l',
'109' =>'m',
'110' =>'n',
'111' =>'o',
'112' =>'p',
'113' =>'q',
'114' =>'r',
'115' =>'s',
'116' =>'t',
'117' =>'u',
'118' =>'v',
'119' =>'w',
'120' =>'x',
'121' =>'y',
'122' =>'z',
# Special chars
'263' =>'c',
'347' =>'s'
);
foreach $key (keys %asciiChars) {
$mod = '&#' . $key . ';';
$text =~ s/$mod/$asciiChars{$key}/gi;
}
And I tried this tonight but it threw an error, so I need to play with it a little:
$text =~ s/ï/i/;
$text =~ s/ö/o/;
$text =~ s/[š$]/s/;
$text =~ s/¥/y/;
Before I keep going down this rabbit hole, trying to find every possible variation and swapping it, can you guys suggest a better way to find when the user is trying to get around the filter?