Filtering profanity - Perl Server Side CGI Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Filtering profanity

csdude55

2:34 am on Jan 18, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have a general regex to catch most profanity and **** it out. The main list comes from a database, so it just loads the table then does a quick search and replace.

What I'm working on right now are "work arounds". Like, when someone posts these (sorry if this is a problem, mods):

$h!t
$h|+
@$$
@ $ $
a-$-s

and so on.

Obviously, I know that people are always going to try to find ways to get around it, but I'm trying to lighten my work load a little in the meanwhile.

So what I'm working on now is a regex that tries to catch specific symbols (or none), followed by a letter, followed by specific symbols, and repeated. Like so:

s/\b[\!@#\$%\^\&\*\(\)\|\\\/]*[a-z]+[\!@#\$%\^\&\*\(\)\|\\\/]+[a-z]+\b/****/i;

# For clarity:
s/
 \b
 [\!@#\$%\^\&\*\(\)\|\\\/-]*
 (
  [a-z]+
  [\!@#\$%\^\&\*\(\)\|\\\/-]+
 )+
 \b
/****/i;

I haven't implemented this yet, though. The problem I'm seeing is that it relies on the word to end with a symbol, so it would catch $h!+, but not $h!t. If I change the second square brackets to * instead of +, though, it's going to catch every word.

Any suggestions on how I might improve this regex to catch all of those different variations?

** This is secondary, but it would be pretty cool if it could count how many alphanumeric characters are in the word, too, and replace it with that many asterisks. So, profanity with 5 or 7 letters would have 5 or 7 asterisks instead of always having 4. I don't know how to do that one at all, though. **

TIA!

phranque

6:05 am on Jan 18, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

i would consider using an existing module and modifying or extending it if necessary.

Regexp::Common::profanity_us
http://search.cpan.org/~tbone/Regexp-Common-profanity_us-4.112150/lib/Regexp/Common/profanity_us.pm [search.cpan.org]

csdude55

9:18 am on Jan 18, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I appreciate the idea! FYI, though, it looks like the CPAN link has changed; that gives me a server error. The new link is:

[metacpan.org...]

Either way, I originally built my system some 10 years ago using NetNanny's list as a starting point. But I needed a lot of customization because of localized words and username-specific filters, so I'm still going to have to build my own secondary system to do the same thing. Using this is probably a better system for most of my problems, but I still need help on the regex to catch those other words.

phranque

10:05 am on Jan 18, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

even if you decide to expand on your current solution, i would still take a look at the code in that module for some clues.

lucy24

5:46 pm on Jan 18, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Psst! You do not need to escape characters within grouping brackets. Exceptions: \ [ ] and sometimes -

csdude55

9:17 pm on Jan 18, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Psst! You do not need to escape characters within grouping brackets. Exceptions: \ [ ] and sometimes -

Hmph, I didn't know that! Thanks :D

I assume you also have to escape ^ and $, too, though? Or would that only be true if the bracket begins with ^ or ends with $? Also, it's my understanding that you don't have to escape - if it's at the beginning or end of the bracket, but if it's between two other characters then it will be read as a "range", right?

lucy24

11:36 pm on Jan 18, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I assume you also have to escape ^ and $, too, though?

Oops, forgot about ^ which has a special meaning in brackets: at the beginning of the group, it means �not" the-rest-of-this-stuff. For a non-initial ^ it probably depends on the RegEx engine; I expect most don't care.

The $ has no meaning within grouping brackets, so it does not need to be escaped.

phranque

3:47 am on Jan 19, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

the only metacharacters within a group are the closing square bracket (]), the hyphen (-), the caret (^), and the backslash (\).

a literal closing bracket closing bracket or a literal backslash must always be escaped.

the caret is only a metacharacter when it is the first character after the opening bracket (group negation).
therefore a literal caret requires escaping only if it is the first character of the group.

the hyphen is a metacharacter when defining a range of characters.
therefore a literal hyphen doesn't require escaping if it is the first character of the group, but must always be escaped elsewhere.

for these reasons it is good practice to put a literal hyphen first in the group and a literal caret elsewhere (to avoid the unnecessary backslash).

lucy24

5:11 am on Jan 19, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I got curious and tried this:
[^^]
SubEthaEdit was perfectly happy to interpret it as "Find all non-caret characters" :)

phranque

6:57 am on Jan 19, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

SubEthaEdit was perfectly happy to interpret it as "Find all non-caret characters"

this is consistent as the first caret is a metacharacter and the second caret is a literal character.

therefore a literal caret requires escaping only if it is the first character of the group.

lucy24

6:59 pm on Jan 19, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You gotta admit, though, it looks goofy. Like a smiley expressing, I guess, extreme surprise. I tried it because I was thinking about non-medial hyphens, which really do require escaping in some RegEx engines.

RegExpInfo [regular-expressions.info] says:

Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.

That phrase �quite inconsistent� is enough to make the blood run cold; it's like when apache dot org talks about unintended consequences. (Still trying to figure out the �other positions�. Maybe they mean locutions like [a-b-c].)

The POSIX and GNU flavors are an exception. They treat backslashes in character classes as literal characters. So with these flavors, you can't escape anything in character classes.

Huh. You learn something new every day. But then, how do you convey the concept of a literal close-bracket?

phranque

11:58 pm on Jan 19, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

In PCRE-based regexp engines closing brackets are escapable:
http ://www.pcre.org/current/doc/html/pcre2pattern.html#SEC9

phranque

12:59 am on Jan 20, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

there are some differences between the PCRE library (used by apache and others) and the perl implementation.
since we are discussing this in the perl forum, i am also including this more appropriate reference.

http://perldoc.perl.org/perlrecharclass.html#Special-Characters-Inside-a-Bracketed-Character-Class

A ] is normally either the end of a POSIX character class ..., or it signals the end of the bracketed character class. If you want to include a ] in the set of characters, you must generally escape it.

However, if the ] is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.

csdude55

1:26 am on Jan 24, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Just because we've also discussed whether to escape special characters in brackets...

I recently did this in Perl:

$text =~ s/[s$][h#][i1!][t+]/****/gi;

Which, based on the above, should be just fine. But it was matching "sit", "hit", and "white".

So then I did this as a test:

$text =~ s/[s\$][h\#][i\!][t+]/1/gi;
$text =~ s/[s\$][h\#][i!][t\+]/2/gi;
$text =~ s/[s\$][h#][i\!][t\+]/3/gi;
$text =~ s/[s$][h\#][i\!][t\+]/4/gi;

And the only one to match was "4"... so apparently, the $ in the brackets needed to be escaped, after all.

For further testing, I also tried this:

$text =~ s/[s\$][h#][i!@][t+]/****/gi;

This caught "shat" as expected, and did NOT catch "shot" as I feared. So it doesn't look like the problem is that it's reading the "$" as a string, or else it would think "@" was an array, too. But for some reason the "$" still causes a problem, so I had to escape it.