Forum Moderators: coopster & phranque

Message Too Old, No Replies

Fun with regex, matching when the user is trying to work around

         

csdude55

9:20 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Still playing with my profanity filter :-)

A common issue is with users trying to use alternate characters in place of recognized characters, trying to get around the filter. Examples including using @ instead of a, $ instead of s, ! or l (lowercase L) instead of i; ! or I (uppercase i) instead of L, and so on.

I have a ton of regex written to filter out profanity, and I duplicate the same things in each one of them:

[s\$]
[a\@]
[il!]

I recently had to deal with a new variation, and have spent most of my day modifying all of the filters to catch it :-/

As a long term fix, I'm thinking of a way to apply this variations at the beginning of the filter, so that I only have to modify them once.

My initial thought is:

1. Create an associative array with all of the potential workarounds; eg, '$' => 's', '@' => 'a', and so on.

2. Split the string by \s

3. Loop through the new array with the associative array, apply the workarounds, then perform a substitution if the new modified word matches a filtered word

Something like:

%badwords = (
'cat' => 1,
'dog' => 1
);

%workarounds = (
'$' => 's',
'@' => 'a',
'l' => 'i',
'!' => 'i',
'j' => 'i',
'+' => 't'
);

$text = 'this is a c@+';

@arr = split(' ', $text);

foreach $original (@arr) {
$_ = $original;

foreach $key(keys %workarounds) {
if (s/\Q$key\E/$workarounds{$key}/i && exists($badwords{$_})) {
$text =~ s/\Q$original\E/****/i;
}
}
}

print $text;
# this is a ****

This works, but given a post of 5,000 words and running each word over a ton of regex filters, this would be SUPER slow!

Any other suggestions?

lucy24

11:03 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In the adjoining thread, I wondered if it's feasible to simply swallow all those sketchy characters in one great gulp. As a bonus, it makes the offender look illiterate. Or, on a truly good day, makes them worry that their keyboard is malfunctioning. Mwa ha ha. Some items on George Carlin's list might even come through as "" and, well, who's complaining?

csdude55

1:13 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I actually did start doing something like that a few years ago, with phrases like "you're an idiot". I just wiped it out entirely, so yeah, it makes them look like the idiot ;-)

In this case, though, it's a little more complicated. For example, I have a guy that's replacing "i" with "j", so it's not just a weird character that I have to worry about. I tried to make the filter recognize phrases that looked suspicious, like:

s{\b(that'?s a b\w+h)\b}{****}gi

(typed, not tested)

but then someone said, "that's a batch of cookies" that got filtered :-/

lucy24

2:03 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now, that's another valid approach, intentionally or otherwise. Years ago, on a different forum, there was one member who kept mistyping the name of a beloved pet so it came out as a vile ethnic slur. It happened to be a slur that isn't used very often, but it did need to be obfuscated, so we (the moderators) simply ran it in the other direction: the substitution text for the ethnic slur was ... the said pet's name.

(As an aside, this forum had fairly imaginative administrator. When we were hit by spammers of a particular type, she fired up a bunch of word-filter substitutions along the lines of “hardcore” >> “boring”.)

So your idiotic user comes out saying “sumbatch”, and wouldn't that be a batch for him.