I have a general regex to catch most profanity and **** it out. The main list comes from a database, so it just loads the table then does a quick search and replace.
What I'm working on right now are "work arounds". Like, when someone posts these (sorry if this is a problem, mods):
$h!t
$h|+
@$$
@ $ $
a-$-s
and so on.
Obviously, I know that people are always going to try to find ways to get around it, but I'm trying to lighten my work load a little in the meanwhile.
So what I'm working on now is a regex that tries to catch specific symbols (or none), followed by a letter, followed by specific symbols, and repeated. Like so:
s/\b[\!@#\$%\^\&\*\(\)\|\\\/]*[a-z]+[\!@#\$%\^\&\*\(\)\|\\\/]+[a-z]+\b/****/i;
# For clarity:
s/
\b
[\!@#\$%\^\&\*\(\)\|\\\/-]*
(
[a-z]+
[\!@#\$%\^\&\*\(\)\|\\\/-]+
)+
\b
/****/i;
I haven't implemented this yet, though. The problem I'm seeing is that it relies on the word to end with a symbol, so it would catch $h!+, but not $h!t. If I change the second square brackets to * instead of +, though, it's going to catch every word.
Any suggestions on how I might improve this regex to catch all of those different variations?
** This is secondary, but it would be pretty cool if it could count how many alphanumeric characters are in the word, too, and replace it with that many asterisks. So, profanity with 5 or 7 letters would have 5 or 7 asterisks instead of always having 4. I don't know how to do that one at all, though. **
TIA!