Forum Moderators: coopster & phranque

Message Too Old, No Replies

Matching when string is delimited with unexpected \W

         

csdude55

6:53 pm on Oct 15, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm working on my profanity filters, and a common problem is when users try to get around them. One such method is to add a non-alpha character.

Example, let's say that I'm filtering "foo". They might say "f-o-o" or "f o o".

My filter converts them matched text to **** (or whatever I've specified), so it looks like this:

$str = 'this is some foo bar crap';

# real data comes from MySQL, but for simplicity...
%pattern = (
'foo' => '****',
'bar' => '***'
);

for (keys %pattern) {
$str =~ /$_\b/$pattern{$_}/gi;
}


My real list is currently kinda messy looking with \W* between every digit:

%pattern = (
'f\W*o\W*o' => '****',
'b\W*a\W*r' => '***'
);


Can you think of a way to make it match and replace if the text is delimited in such a way, without explicitly saying \W?* in the pattern?

I know that I could make it match by eliminating all \W from $str, like so:

# typed for this post, not tested
$newStr = $str =~ s/\W//g;

for (keys %pattern) {
if ($newStr =~ /$_\b/i) {
# but then how do I know what to replace in $str?
}
}

lucy24

8:13 pm on Oct 15, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This sounds like one for bench-testing, since you’re doing the same thing either way: filter out the \W parts of any string. If both approaches are equally easy for you to read and debug, is there any noticeable difference in speed?

but then how do I know what to replace in $str
Wouldn’t you still be running the same word filter? The only difference is whether you’re excluding the \W beforehand or concurrently.

A more challenging question is how to distinguish spurious \W insertions from legitimate word breaks: I assume you wouldn’t want to filter someone who said “And then you wash it in vinegar”, while retaining the ability to filter someone who says “f o o b a r”

:: vague mental association here with “fubar” ::

deploying intentionally inserted spaces.

It also of course depends on how large the forum-or-whatever is. Is it too large to manually keep track of people who deliberately try to bypass word filters, and seek them out for slaps on the wrist? What about people who sneakily replace one or more letters with a non-ASCII lookalike? Or is it an Old Skool site like the present one, where anything outside 1252 gets entitized?

We begin to enter “It is impossible to make anything foolproof, because fools are so damned ingenious” territory.

csdude55

12:26 am on Oct 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If both approaches are equally easy for you to read and debug, is there any noticeable difference in speed?

I can't honestly say that it's equally easy to read, especially when the pattern is already complicated. Like:

\b((?:ba+d+|candy|dum+b*|fa+t+|hot|jack|smart|stupid|lard)\W*)?[a@][sz\$][sz\$](?:clown|hat|heat|h+o+l+e|wipe)?(?!et|ert|ist|um|ess|ign|ure|ault)


Which now has to become the impossible to read:

\b((?:b\W*a+\W*d+|c\W*a\W*n\W*d\W*y|d\W*u\WebmasterWorld*+b*|f\W*a+\W*t+|h\W*o\W*t|j\W*a\W*c\W*k|s\WebmasterWorld*a\W*r\W*t|s\W*t\W*u\W*p\W*i\W*d|l\W*a\W*r\W*d)\W*)?[a@]\W*[sz\$]\W*[sz\$](?:c\W*l\W*o\W*w\W*n|h\W*a\W*t|h\W*e\W*a\W*t|h+\W*o+\W*l+\W*e|w\W*i\W*p\W*e)?(?!et|ert|ist|um|ess|ign|ure|ault)


Wouldn’t you still be running the same word filter? The only difference is whether you’re excluding the \W beforehand or concurrently.

Not exactly. Let's say that I have this:

$str = 'this is some f-o-o bar crap';

$newStr = $str =~ s/\W//g;
# Result: "thisissomefoobarcrap"

for (keys %pattern) {
# "foo" matches
if ($newStr =~ /$_/i) {
# so I know that the problem word exists, but I don't know how
# to tell it to use that info to replace f-o-o
}
}


A more challenging question is how to distinguish spurious \W insertions from legitimate word breaks: I assume you wouldn’t want to filter someone who said “And then you wash it in vinegar”, while retaining the ability to filter someone who says “f o o b a r”

Well. Shoot. That's an excellent point :'-(

It also of course depends on how large the forum-or-whatever is. Is it too large to manually keep track of people who deliberately try to bypass word filters, and seek them out for slaps on the wrist?

Definitely too large... we have about 10,000 posts per day.

What about people who sneakily replace one or more letters with a non-ASCII lookalike? Or is it an Old Skool site like the present one, where anything outside 1252 gets entitized?

I use contenteditable so I DO get a lot of weird stuff submitted! For the most part I auto-convert the characters that I know to something recognizable (eg, tr/ïöš/ios/) and strip the ones that I don't. This hasn't been a major issue YET, but now that you've said it... LOL

Right now I guess I'm leaning towards just emailing me if $newStr matches. It gets more complicated when I have to worry about HTML tags, so I guess the working pattern is more like:

$newStr = $str =~ s/\W|<[^>+]//g;

for (keys %pattern) {
if ($newStr =~ /$_/i) {
# email me
last;
}
}


The long term problem is that I can't be here 24/7, so by time I see it then the fight has blown WAY out of hand! That can end up taking hours to fix, and inevitably runs off users anyway. Which is why I was really hoping for a more automated fix, but you're right, the only way to do that is to manually code the \W* the whole way and try to allow for acceptable variations :-/

lucy24

4:26 am on Oct 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can you capture the matched pattern? That is, if you’ve got something that says

(foo|bar|diddle|dumpling|jiggers)

then whatever matches would be expressible as \1 or $1 or whatever your RegEx dialect uses. And then if you’re doing a different replacement for every word--which can be very entertaining and clever*--you plug that capture into the replacement.

* I once knew a forum that was being hit pretty intensely by spammers on a particular theme. The administrator made a series of specific replacements, along the lines of “hot action” >> “inutterable tedium” and so on.

csdude55

5:42 am on Oct 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As far as I know, it returns the text from the string that matched, not the pattern itself. So:

$str = 'this is some f-o-o bar crap';

if ($str =~ /(f\W*o\W*o|lorem|ipsum)/i) {
print $1;
}


would print "f-o-o".

It would be super helpful if I could find a way to return the pattern that matched (meaning, return f\W*o\W*o\W*), but so far no luck :-/ It would have been cool if it would let me name the group the same as the regex, but the name has to be alphanumeric so that doesn't help, either.

If I'm misunderstanding, what did you have in mind?

lucy24

4:08 pm on Oct 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You’d have to take the route of stripping the \W before applying the filters. f-o-o becomes foo and then that’s your capture.

csdude55

9:59 pm on Oct 17, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, this got a LOT more complicated than intended, but so far this seems like it works. I'm posting this for the sake of posterity, but if anyone seems improvements then I'm all ears... erm, eyes!

# DELIMIT PATTERN WITH \W* || <[^>]+> BETWEEN EACH CHARACTER TO CATCH PEOPLE
# TRYING TO GET PASS THE FILTER
# Example, using 'f-o-o' or 'f<br>o<br>o' instead of 'foo'
#
# Notes:
# 1. I'm using the /x modifier throughout to make it easier to read, but keep
# in mind that this negates spaces so it might throw things off in practice
#
# 2. \Q ... \E escapes patterns in regexes, so I used that when I needed to
# escape several things in a row

# the test string
$_ = 'this is $ome+hing like some f-o-o bar crap';

# use single quotes to define everything literally, but remember that regex
# reads it as double quotes
$pattern = 'f[o0]o|[$s]ome(?:[t+]hing)?|(?!this|that)other';

# Step 1, I'm going to use ï as a delimiter so remove it from $_ to
# ensure no accidental matches later
s/ï/i/;

# Step 2, find anything between [ .. ] or (?! .. ) in $pattern, add it to
# %temp, then replace it in $pattern with ï$xï
$x = 1;

while (
$pattern =~ m{(
\[[^]]+\] |
\Q(?!\E[^)]+\Q)\E
)}x

# safety net, real value will be number of potential matches in $pattern
&& $x < 10) {
$temp{$x} = $1;

$pattern =~ s/\Q$1\E/ï$xï/;

$x++;
}

# at this point:
# $pattern = 'fï1ïo|ï2ïome(?:ï3ïhing)?|ï4ïother';
# $temp{'1'} = '[o0]';
# $temp{'2'} = '[$s]';
# $temp{'3'} = '[t+]';
# $temp{'4'} = '(?this_that)';

# Step 3, replace all optional whitespace with (\W|<[^>]+>)*
$delimiter = '(?:\\W|<[^>]+>)*';

$pattern =~ s/\s[?*]/$delimiter/g;

# Step 4, substitute all \w with \w(\W|<[^>]+>)* in $pattern unless it's
# followed by *, ?, (, |, at the end of the string (\Z), or
# ), )?, )* AND followed by | or \Z

$pattern =~ s{
(
ï\dï |
[a-z]
)
(?!
[*?(|] |
\Z |
\)[?*]?(
\| |
\Z
)
)
}
{$1$delimiter}xg;

# at this point:
# $pattern = 'f(?:\W|<[^>]+>)*ï1ï(?:\W|<[^>]+>)*o|ï2ï(?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*m(?:\W|<[^>]+>)*e(?:ï3ï(?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*i(?:\W|<[^>]+>)*n(?:\W|<[^>]+>)*g)?|ï4ï(?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*t(?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*e(?:\W|<[^>]+>)*r';

# Step 5, go back and replace :ï:1:ï: with their original
for (sort keys %temp) {
$pattern =~ s/ï$_ï/$temp{$_}/g;
}

# at this point:
# $pattern = 'f(?:\W|<[^>]+>)*[o0](?:\W|<[^>]+>)*o|[$s](?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*m(?:\W|<[^>]+>)*e(?:[t+](?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*i(?:\W|<[^>]+>)*n(?:\W|<[^>]+>)*g)?|(?!this|that)(?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*t(?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*e(?:\W|<[^>]+>)*r';


# Step 6, $pattern is done so let's do it
# Note, I'm not sure if using \b here is necessary, but it doesn't hurt
# I found that \b[$s] didn't match like expected, so lucy24 gave me
# the (^|\W|\$) alternative idea
s/(\b|^|\W|\$)(?:$pattern)(?:e[dr]|ing?|e?s|y)?\b/$1****/xgi;

print;
# Returns:
# this is **** like **** **** bar crap

lucy24

9:57 pm on Oct 18, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<not wholly unrelated tangent>
In one of those utterly predicable coincidences, discussion on a {redacted}-based comment thread drifted onto the subject of Words. In the course of to and fro, I learned that on this particular platform:
--you CAN say “dick”
--you CAN NOT say “dickhead” (see below)
--you CAN say “dìckhead” (non-ASCII but still securely Latin-1)
--you CAN NOT say “d*i*c*k*h*e*a*d” (using asterisks)

Fortunately, the present forum does not have the dreaded “Your post is being held for moderation”, and it does have a Preview, so I can nip in and make changes if needed. (In this case, a judicious insertion of format tags in example #2.)

Fortunately, the present site’s moderators know that this whole discussion is Strictly For Medicinal Purposes.
</tangent>

csdude55

9:16 pm on Oct 19, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well... I've spent a few days on this and I've gotten it to work without the massive loops! But... bench testing shows the new style to be MUCH slower than the original!

# the test string
$_ = 'this is $ome+hing like some f-o-o bar crap';

# the test pattern
$pattern = 'f[o0]o|[$s]ome(?:[t+]hing)?|(?!this|that)other';

# I added these suffixes here so that they'll be delimited, too
$pattern = '(?:' . $pattern . ')(?:e[dr]|ing?|e?s|y)?';

@more =
$pattern =~ m{(
\[[^]]+\] |
\Q(?ï<\E[^)]+\Q)\E |
\Q(?!\E[^)]+\Q)\E
)}xg;

unshift(@more, '');

for ($x = 1; $x <= $#more; $x++) {
$hash{$more[$x]} = 'ï' . $x . 'ï';
$temp{$x} = $more[$x];

if ($patternFix) { $patternFix .= '|'; }
$patternFix .= '(' . quotemeta($more[$x]) . ')';
}

# at this point
# $patternFix = '(\[o0\])|(\[\\\$s\])|(\[t\+\])|(\(\?\!this\|that\))|(\[dr\])';

$pattern =~ s/$patternFix/$hash{$+}/gi;

# at this point
$pattern = '(?:fï1ïo|ï2ïome(?:ï3ïhing)?|ï4ïother)(?:eï5ï|ing?|e?s|y)?';

$delimiter = '(?:\\W|<[^>]+>)*';

$pattern =~ s/\s[?*]/$delimiter/g;

$pattern =~ s{
(
ï\dï |
[a-z]
)
(?!
[*?(|] |
\)[?*]? |
\Z
)
}
{$1$delimiter}xg;

$pattern =~ s/ï(\d+)ï/$temp{$+}/gi;

s/(\b|^|\W|\$)(?:$pattern)\b/$1$2****/gi;

print;
# Returns:
# this is **** like **** **** bar crap


I only tested using 3 classes in the pattern instead of my full list, but the bench test over 1000 iterations showed:

Original with expressions in a loop: 0.0683789253234863s
This version without the expressions in a loop: 5.35150384902954s

:-O :-O :-O

I know that @phranque mentioned before that regex engines (especially in perl) are likely to be highly optimized, but still, I expected at least a slight improvement... not 100 times slower!

I modified the expressions to eliminate /x, then for testing removed quotemeta() and the $+ references, but none of those were the bottlenecks. I'm still poking around to see if I can find it, but right now I'm flabbergasted.

csdude55

5:46 am on Oct 20, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The performance hit looks to be coming from this line:

$pattern =~ s/$patternFix/$hash{$+}/gi;

I'm testing with 46 patterns. If I remove from that line down then 100 iterations is 0.0136s, but just putting that one line back in takes it up to 4.30s. Surprisingly, though, the issue appears to be the MATCH, not the substitution! Because changing it to this has no impact on the speed:

$pattern =~ s/$patternFix/TEMP/gi;

So I started playing with the pattern, and found that each pattern that I add increases the processing time by about 0.3s. Apparently it takes a LOT more time for Perl to store the matches in memory?

Either way, unless someone can suggest a way to make this line process faster then it appears that having the expressions in a loop is still MUCH faster :-/

lucy24

4:24 pm on Oct 20, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



$pattern =~ s/$patternFix/$hash{$+}/gi; 
For those of us who doesn't speak Perl ;) can you translate? (I assume the last bit is the flags for “global” and “case-insensitive”.)

csdude55

8:40 pm on Oct 20, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sure :-)

# the test pattern I'm using
$pattern = 'f[o0]o|[$s]ome(?:[t+]hing)?|(?!this|that)other';

# this pushes $1 of the regex match to the @more array
# so anything in $pattern that matches [...], (?<...>, or (?!...) is
# pushed to @more
@more =
$pattern =~ m{(
\[[^]]+\] |
\Q(?<\E[^>]+\Q>\E |
\Q(?!\E[^)]+\Q)\E
)}xg;

# since regex starts at $1 instead of 0, I'm adding a blank
# index to the [0] position
unshift(@more, '');

# loop from 1 to the length of @more
for ($x = 1; $x <= $#more; $x++) {

# add to a %hash associative array
# since the first match in $pattern should be [o0], $more[1]
# should equal [o0]. So now:
# $hash{'[o0]'} = 'ï1ï';
# $temp{'1'} = '[o0]';
$hash{$more[$x]} = 'ï' . $x . 'ï';
$temp{$x} = $more[$x];

# quotemeta() auto-quotes meta characters, so now:
# $patternFix = '(\[o0\])';
if ($patternFix) { $patternFix .= '|'; }
$patternFix .= '(' . quotemeta($more[$x]) . ')';

# on the next loop, $patternFix would become:
# $patternFix = '(\[o0\])|(\[\\\$s\])';
}


So with all of that said, the line that's causing the performance hit:

$pattern =~ s/$patternFix/$hash{$+}/gi;


$+ was the magic answer that I actually needed; it references the last match. So in this case, it should be substituting the first match ([o0]) to the value of $hash{'1'} (ï1ï), the second match ([\$s]) to $hash{'2'} (ï2ï), and so on.

After this I can go through all of the other characters and add (?:\W|<[^>]+>)* between them, then do another substitute to reverse the ï\dï back to its original value.

For the sake of testing, I removed that first part and hard-coded $patternFix, then substituted with the word "TEST" instead of $hash{$+}. It was still as slow as molasses :-/

(I assume the last bit is the flags for “global” and “case-insensitive”.)

Correct. The /g is mandatory here or it'll stop after the first match, but the /i is optional.

[edited by: phranque at 8:45 pm (utc) on Oct 20, 2021]
[edit reason] disable graphic smile faces [/edit]

csdude55

12:12 am on Oct 21, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've spent enough time on this so I'm about to give up... this is my final version:

# assuming that $_ and $pattern are set previously
s/ï/i/;

# this is marginally faster than the while() statement I'd
# used earlier
@more =
$pattern =~ m{(

# catch [...]
\[.+?\] |

# catch group names, (?<name>foo)
\Q(?<\E.+?\Q>\E |

# negative lookahead
# I don't use positive lookahead anywhere, but if I did
# then it could be added here
\Q(?!\E.+?\Q)\E
)}xg;

unshift(@more, '');

my %temp;
for ($x = 1; $x <= $#more; $x++) {
$temp{$x} = $more[$x];
$pattern =~ s/\Q$more[$x]\E/ï$xï/g;
}

$delimiter = '(?:\W|<.+?>)*';

$pattern =~ s{
(
ï\dï |
[a-z]
)
(?!
\( |
\| |
\] |
\)[?*] |
\Z
)
}
{$1$delimiter}xg;

for (sort keys %temp) {
$pattern =~ s/ï$_ï/$temp{$_}/g;
}

# I discovered that \w+ and [ab]+ were becoming
# [ab](?:\W|<[^>]+>)*+, so this converts it back to
# (?:[ab](?:\W|<[^>]+>)*)+
$pattern =~ s{(\w|\[[^]]+\])\Q$delimiter\E(\s*)([?+*])}
{(?:$1$delimiter)$3$2}g;

s/(\b|^|\W|\$)$pattern\b/$1****/gi;


A bench test of 1000 iterations with 46 potential matches in $pattern takes 1.2867s, which is WAY higher than I want but it's still a whole lot better than the alternative that I showed before.

So I guess I'm bailing on performance gains for now. But if anyone can suggest how to modify the $pattern =~ s{...}{$1$delimiter}xg; line to prevent the *+ issue that I'm "fixing" at the end, it might cut the time down a bit.

lucy24

4:44 pm on Oct 21, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



since regex starts at $1 instead of 0
If perl works the same as Apache--the only context in which I have occasion to use $0--then strictly speaking $1 through $9 are the captures, while $0 is the entire matched pattern. For example, if the pattern is
^34\.(6[4-9]|[7-9]\d)
then $0 is 34.83 (or whatever it happens to be) while $1 would be only 83

In some circumstances, this may be relevant. If your pattern is, say, \w+(ck|ss)\w* then $1 is only "ck" or "ss" while $0 would be the whole word, whatever it is.

csdude55

5:11 pm on Oct 21, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure when it changed, but from the Perldoc:

$& returns the entire matched string. (At one point $0 did also, but now it returns the name of the program.)


Worse:

If your code is to run on Perl 5.16 or earlier, beware that once Perl sees that you need one of $&, $` [the right angle apostrophe], or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program.


I only knew this because I first learned about $` and $' last year and had used them a bit, so I had to go back a few weeks ago and undo all of them! LOL