Matching when string is delimited with unexpected \W

I'm working on my profanity filters, and a common problem is when users try to get around them. One such method is to add a non-alpha character.

Example, let's say that I'm filtering "foo". They might say "f-o-o" or "f o o".

My filter converts them matched text to **** (or whatever I've specified), so it looks like this:

$str = 'this is some foo bar crap';

# real data comes from MySQL, but for simplicity...
%pattern = (
 'foo' => '****',
 'bar' => '***'
);

for (keys %pattern) {
 $str =~ /$_\b/$pattern{$_}/gi;
}

My real list is currently kinda messy looking with \W* between every digit:

%pattern = (
 'f\W*o\W*o' => '****',
 'b\W*a\W*r' => '***'
);

Can you think of a way to make it match and replace if the text is delimited in such a way, without explicitly saying \W?* in the pattern?

I know that I could make it match by eliminating all \W from $str, like so:

# typed for this post, not tested
$newStr = $str =~ s/\W//g;

for (keys %pattern) {
 if ($newStr =~ /$_\b/i) {
  # but then how do I know what to replace in $str?
 }
}

$str = 'this is some f-o-o bar crap'; $newStr = $str =~ s/\W//g; # Result: "thisissomefoobarcrap" for (keys %pattern) { # "foo" matches if ($newStr =~ /$_/i) { # so I know that the problem word exists, but I don't know how # to tell it to use that info to replace f-o-o } }

# DELIMIT PATTERN WITH \W* || <[^>]+> BETWEEN EACH CHARACTER TO CATCH PEOPLE # TRYING TO GET PASS THE FILTER # Example, using 'f-o-o' or 'f<br>o<br>o' instead of 'foo' # # Notes: # 1. I'm using the /x modifier throughout to make it easier to read, but keep # in mind that this negates spaces so it might throw things off in practice # # 2. \Q ... \E escapes patterns in regexes, so I used that when I needed to # escape several things in a row # the test string $_ = 'this is $ome+hing like some f-o-o bar crap'; # use single quotes to define everything literally, but remember that regex # reads it as double quotes $pattern = 'f[o0]o|[$s]ome(?:[t+]hing)?|(?!this|that)other'; # Step 1, I'm going to use ï as a delimiter so remove it from $_ to # ensure no accidental matches later s/ï/i/; # Step 2, find anything between [ .. ] or (?! .. ) in $pattern, add it to # %temp, then replace it in $pattern with ï$xï $x = 1; while ( $pattern =~ m{( \[[^]]+\] | \Q(?!\E[^)]+\Q)\E )}x # safety net, real value will be number of potential matches in $pattern && $x < 10) { $temp{$x} = $1; $pattern =~ s/\Q$1\E/ï$xï/; $x++; } # at this point: # $pattern = 'fï1ïo|ï2ïome(?:ï3ïhing)?|ï4ïother'; # $temp{'1'} = '[o0]'; # $temp{'2'} = '[$s]'; # $temp{'3'} = '[t+]'; # $temp{'4'} = '(?this_that)'; # Step 3, replace all optional whitespace with (\W|<[^>]+>)* $delimiter = '(?:\\W|<[^>]+>)*'; $pattern =~ s/\s[?*]/$delimiter/g; # Step 4, substitute all \w with \w(\W|<[^>]+>)* in $pattern unless it's # followed by *, ?, (, |, at the end of the string (\Z), or # ), )?, )* AND followed by | or \Z $pattern =~ s{ ( ï\dï | [a-z] ) (?! [*?(|] | \Z | \)[?*]?( \| | \Z ) ) } {$1$delimiter}xg; # at this point: # $pattern = 'f(?:\W|<[^>]+>)*ï1ï(?:\W|<[^>]+>)*o|ï2ï(?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*m(?:\W|<[^>]+>)*e(?:ï3ï(?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*i(?:\W|<[^>]+>)*n(?:\W|<[^>]+>)*g)?|ï4ï(?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*t(?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*e(?:\W|<[^>]+>)*r'; # Step 5, go back and replace :ï:1:ï: with their original for (sort keys %temp) { $pattern =~ s/ï$_ï/$temp{$_}/g; } # at this point: # $pattern = 'f(?:\W|<[^>]+>)*[o0](?:\W|<[^>]+>)*o|[$s](?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*m(?:\W|<[^>]+>)*e(?:[t+](?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*i(?:\W|<[^>]+>)*n(?:\W|<[^>]+>)*g)?|(?!this|that)(?:\W|<[^>]+>)*o(?:\W|<[^>]+>)*t(?:\W|<[^>]+>)*h(?:\W|<[^>]+>)*e(?:\W|<[^>]+>)*r'; # Step 6, $pattern is done so let's do it # Note, I'm not sure if using \b here is necessary, but it doesn't hurt # I found that \b[$s] didn't match like expected, so lucy24 gave me # the (^|\W|\$) alternative idea s/(\b|^|\W|\$)(?:$pattern)(?:e[dr]|ing?|e?s|y)?\b/$1****/xgi; print; # Returns: # this is **** like **** **** bar crap

# the test string $_ = 'this is $ome+hing like some f-o-o bar crap'; # the test pattern $pattern = 'f[o0]o|[$s]ome(?:[t+]hing)?|(?!this|that)other'; # I added these suffixes here so that they'll be delimited, too $pattern = '(?:' . $pattern . ')(?:e[dr]|ing?|e?s|y)?'; @more = $pattern =~ m{( \[[^]]+\] | \Q(?ï<\E[^)]+\Q)\E | \Q(?!\E[^)]+\Q)\E )}xg; unshift(@more, ''); for ($x = 1; $x <= $#more; $x++) { $hash{$more[$x]} = 'ï' . $x . 'ï'; $temp{$x} = $more[$x]; if ($patternFix) { $patternFix .= '|'; } $patternFix .= '(' . quotemeta($more[$x]) . ')'; } # at this point # $patternFix = '(\[o0\])|(\[\\\$s\])|(\[t\+\])|($\?\!this\|that$)|(\[dr\])'; $pattern =~ s/$patternFix/$hash{$+}/gi; # at this point $pattern = '(?:fï1ïo|ï2ïome(?:ï3ïhing)?|ï4ïother)(?:eï5ï|ing?|e?s|y)?'; $delimiter = '(?:\\W|<[^>]+>)*'; $pattern =~ s/\s[?*]/$delimiter/g; $pattern =~ s{ ( ï\dï | [a-z] ) (?! [*?(|] | \)[?*]? | \Z ) } {$1$delimiter}xg; $pattern =~ s/ï(\d+)ï/$temp{$+}/gi; s/(\b|^|\W|\$)(?:$pattern)\b/$1$2****/gi; print; # Returns: # this is **** like **** **** bar crap

# the test pattern I'm using $pattern = 'f[o0]o|[$s]ome(?:[t+]hing)?|(?!this|that)other'; # this pushes $1 of the regex match to the @more array # so anything in $pattern that matches [...], (?<...>, or (?!...) is # pushed to @more @more = $pattern =~ m{( \[[^]]+\] | \Q(?<\E[^>]+\Q>\E | \Q(?!\E[^)]+\Q)\E )}xg; # since regex starts at $1 instead of 0, I'm adding a blank # index to the [0] position unshift(@more, ''); # loop from 1 to the length of @more for ($x = 1; $x <= $#more; $x++) { # add to a %hash associative array # since the first match in $pattern should be [o0], $more[1] # should equal [o0]. So now: # $hash{'[o0]'} = 'ï1ï'; # $temp{'1'} = '[o0]'; $hash{$more[$x]} = 'ï' . $x . 'ï'; $temp{$x} = $more[$x]; # quotemeta() auto-quotes meta characters, so now: # $patternFix = '(\[o0\])'; if ($patternFix) { $patternFix .= '|'; } $patternFix .= '(' . quotemeta($more[$x]) . ')'; # on the next loop, $patternFix would become: # $patternFix = '(\[o0\])|(\[\\\$s\])'; }

# assuming that $_ and $pattern are set previously s/ï/i/; # this is marginally faster than the while() statement I'd # used earlier @more = $pattern =~ m{( # catch [...] \[.+?\] | # catch group names, (?<name>foo) \Q(?<\E.+?\Q>\E | # negative lookahead # I don't use positive lookahead anywhere, but if I did # then it could be added here \Q(?!\E.+?\Q)\E )}xg; unshift(@more, ''); my %temp; for ($x = 1; $x <= $#more; $x++) { $temp{$x} = $more[$x]; $pattern =~ s/\Q$more[$x]\E/ï$xï/g; } $delimiter = '(?:\W|<.+?>)*'; $pattern =~ s{ ( ï\dï | [a-z] ) (?! $ | \| | \] | $[?*] | \Z ) } {$1$delimiter}xg; for (sort keys %temp) { $pattern =~ s/ï$_ï/$temp{$_}/g; } # I discovered that \w+ and [ab]+ were becoming # [ab](?:\W|<[^>]+>)*+, so this converts it back to # (?:[ab](?:\W|<[^>]+>)*)+ $pattern =~ s{(\w|\[[^]]+\])\Q$delimiter\E(\s*)([?+*])} {(?:$1$delimiter)$3$2}g; s/(\b|^|\W|\$)$pattern\b/$1****/gi;

Matching when string is delimited with unexpected \W

csdude55

lucy24

csdude55

lucy24

csdude55

lucy24

csdude55

lucy24

csdude55

csdude55

lucy24

csdude55

csdude55

lucy24

csdude55

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week