Filtering Profanity and Spam - What to do when you find it? - Webmaster General forum at WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Filtering Profanity and Spam - What to do when you find it?

NickMNS

9:38 pm on Jan 30, 2017 (gmt 0)

I am building a script to filter profanity and spam words from comments submissions. My question is what do you do once you have a positive?

Due to the nature of the site my intent is to limit the comments to one comment per day and one per entity. This will prevent users from using comments as forum. This will also limit the ability of bots from leaving comments en mass.

lucy24

10:38 pm on Jan 30, 2017 (gmt 0)

what do you do once you have a positive?

Is the site too big to handle things on a case-by-case basis? Think of the present forum, where every time you have occasion to name someone from Thailand, the name turns into ### because the word filter can't distinguish between a common name element and a Bad Word. (Well, I guess, a Spam Word.) There are genuine spam posts, and there is genuine unacceptable language ... and there are blunders.

My favorite case-by-case illustration: A forum member (not here) routinely mis-typed the name of a beloved pet so it came out as a vile ethnic slur, which triggered the automated word censors, leading other members to wonder what absolutely filthy name she had given her pet. The solution I came up with was to auto-convert the said ethnic slur into the actual name of her pet. (This was a php/bb forum, where you can program in a different substitution for any given word. We won't talk about what I did one April Fool's Day...)

NickMNS

2:35 am on Jan 31, 2017 (gmt 0)

Is the site too big to handle things on a case-by-case basis?

No it is just starting out from scratch. Hopefully one day it will that big. But I still would like to minimize the work required to maintain the site, so that I don't get constrained by this workload at some point in the future.

Regarding Thai names, I actually have that part figured out (differentiating between spam words and peoples names) as I had to address that exact issue with my current site. But in this case I doubt that users will be leaving individual's names as part of the comments, so my filter will be more light weight.

There are genuine spam posts, and there is genuine unacceptable language ... and there are blunders.

But my question still is:
Do I delete the comment (this addresses your first 2 points at the detriment of the third)
Do ask the user to correct it (this assumes the third, likely prevents the second but it may allow the spammers a work around)
Do I send it to moderation (this pushes the problem to me to decide)

lucy24

3:00 am on Jan 31, 2017 (gmt 0)

Delete vs. correct is where case-by-case comes in. If you can tell at a glance that it's a pure-spam post from beginning to end: delete. If it's a legitimate post containing a bad word: correct, preferably in an automated way that won't take any of your time. If the same person consistently uses inappropriate words, and never seems to notice that they're being censored: individual contact. Or ban, if you just don't have the energy to deal with it.

If you're set up to do word-for-word substitutions, that can solve a lot of problems. The forum I mentioned above was once hit by a flurry of spambots. As a stopgap measure, the administrator instituted some targeted substitutions, such as changing �hardcore� to �boring�.

Have you set up a system for dealing with links in posts, for example by not letting people post links until they're established members? (This doesn't have to be very long, provided it's a combination of postcount and time-on-site.) Or are you mainly concerned with people using vocabulary that will make other people avoid the site?

How clever of you to teach your program to recognize legitimate Thai names. (Word-final, preceded by eight or more letters...) Another thing to look out for is spurious mixing of letters and numbers like, ahem, cough-cough, �p0rn�. That can be a flag for auto-deletion. Or not, depending on venue.

phranque

5:23 am on Jan 31, 2017 (gmt 0)

a lot of your answers depend on the community (if any) of your contributors of UGC and will probably change over time.

for example, in this forum there are elements of everything you discussed in the OP happening (except for the once/day posting limitation).

tangor

6:23 am on Jan 31, 2017 (gmt 0)

Make the determination of why you want to filter for "bad words" in the first place. That will tell you what to do.

In most cases an authomated word sub will do the job of keeping the site healthy as far as search engines are concerned. Most users will recognize such as a proactive measure on your part and generally accept it.

Obvious spam? Most sites that care about their ranking and visitors will delete it upon discovery.

As far as offering the user an opportunity to clean up a post and resubmit .... that's more significant in work/time and something you do NOT want to set up as automated to the user. Explanations are required and while there might be a half dozen stock replies, you'd still want to make those decisions rather than rely on a script.

Somewhere on your site should be a TOS that requires language conformity, ie: "No profanity allowed". This is a helpful in dealing with users who seem to feel they can say whatever they like on somebody's website.

Me? I just delete the non-compliant messages. If the user becomes a problem they are banned and blocked. Whatever you do, just be consistent!

NickMNS

4:35 pm on Jan 31, 2017 (gmt 0)

Thanks for the input.

Based on your advice I am going to design the system such that any comment that the filter catches will be submitted for moderation. Then a human (me!) will review it and decide to delete it or correct it. Do you think it is worth while to randomly sample posts that are not caught by the filters, just to be sure. Take say 2 to 5% of all posts and send them to moderation?

tangor

5:24 pm on Jan 31, 2017 (gmt 0)

depends on the number of posts per day.

martinibuster

1:52 am on Feb 1, 2017 (gmt 0)

Aside from spammers, I operate under the idea that every visitor/guest is important. I try to work with people by educating them on policy. Most people get it and respect the rules because they benefit and want to fit in or keep benefiting. So if possible, moderation and a gentle reminder.

Also, moderation is best when done sparingly or simply rarely at all.

tangor

4:17 am on Feb 1, 2017 (gmt 0)

Moderation is a two edged sword. Pre-mod (must approve before posting) makes the moderator liable. Post-mod removes some of the liability if acted upon quickly.

Users are important. It is equally important, as martinibuster suggests, that setting up a policy and making sure it is read before participation! is essential.

If UGC is key to growing the site then any effort to educate the users before they get started is time and effort well spent. However, if the UGC tends toward commentary that might have such objectionable language on a regular basis then installing a filter (or series of filters) will help with moderation and that moderation should be swift and consistent. With the speed with which sites are indexed these days it won't take more than a few bad words slipping through to have an impact. Pre-moderation will deal with that possibility best.

Most of the time it is the topic, the passion, and the users which will determine how that UGC is moderated.

Also keep in mind that many sites have either suspended or deleted their commenting systems simply because the moderation becomes either impossible, or loses whatever value it might have had when compared to other metrics.

lucy24

7:55 pm on Feb 1, 2017 (gmt 0)

Do you think it is worth while to randomly sample posts

Surely you do that anyway? Maybe not systematically and formally--but it is your own site. You must glance at random pages now and then. You'll spot if there is stuff going on that you hadn't budgeted for.

In addition to everything else, program your system to tell you if a given topic draws an exceptional number of posts. That would be yet another thing to take a closer look at and see if it's good or bad.

NickMNS

7:59 pm on Feb 1, 2017 (gmt 0)

What I meant was to bake into the program such that it would send say 5% of all posts for moderation automatically.

For right now I have elected not program it in. I have enough work just building the simple mod process.

csdude55

1:16 am on Mar 18, 2017 (gmt 0)

I know this is a late reply, sorry.

My sites include classifieds and message boards, so I deal with profanity and spam regularly.

Profanity, I just convert to **** and allow the post. I use 4 * regardless of the length of the word. But I'll warn you, this has become quite a nuisance over the years! Thanks to people trying to get around the filters (eg, using "$h!+" or "funk you"), I've developed a very complicated, much longer system than I originally intended.

Spam... new users, it's just discarded and forgotten. I used to get a few hundred of these a day until I started blocking most non-US IP addresses through the firewall.

Spam... established users, I discard and send them a Private Message reminding them of the rules.