Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regex not matching when string contains a +

         

csdude55

6:29 am on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have this substitution:

$str =~ s{
\b
ca\+
\b
}
{cat}/xgi;

I intend for this to change this is a ca+ to this is a cat, but it's not matching.

If I remove the closing \b, it does match! But then it might match things I don't mean for it to match.

I tried it with quotemeta() and \b\Qca+\E\b, too, but they didn't match, either. And just for fun, I tried double escaping, triple escaping, all the way up to septuple escaping, but that didn't work either.

I'm guessing that all meta characters do this, because using ca^ and then \^ instead of \+ had the same result.

What's the magic trick here?

phranque

7:27 am on Jan 18, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i'm wondering if the extra whitespace in the regexp is a problem.
i would try this:
$str =~ s/\bca\+\b/cat/xgi;

lucy24

4:38 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I intend for this to change this is a ca+ to this is a cat, but it's not matching
The plus sign is a reserved character.

The pattern ca+ matches ca, caa, caaa and on into infinity. Leave off the word boundary, and it also matches cat, caat, caaat and so on.

To match the literal string “ca+” the pattern would have to be ca\+ with escaped + sign. And you can't use a \b word boundary at the end, since + is already a non-word character.

Edit: I may have misread OP. But the key point is that \b is meaningless when adjacent to a non-word character. In fact, \bca\b would match “ca+” (but wouldn’t include the + sign).

csdude55

6:15 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@phranque, my original doesn't use /x, I just did that here for readability :-)

@lucy24, I'm afraid that you did misread it, my regex does have the + escaped:

\bca\+\b

To clarify, I only want it to match when the user input is a literal +. In real world, I'm using this in a profanity filter and I'm trying to prevent people from using a + in place of a t when trying to get around the filter.

I tried to convert all + to the diacritic ï (without the \b), then convert caï\b, then convert all ï back to +, but this doesn't work either; it also fails to match with the \b:

$str = qq~
this is a ca+
~;

$str =~ s/\+/ï/g;

print $str;
# this is a caï

# I tried it with and without \Q..\E
$str =~ s{\b\Qcaï\E\b}
{cat}xgi;

print $str;
# this is a caï

$str =~ s/ï/+/g;

print $str;
# this is a ca+


You can see the exact code working here:

[jdoodle.com...]

lucy24

7:08 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem is the sequence \+\b. A word-boundary anchor only has meaning if it's adjacent to a word character. So in this respect the sequence ï\b should work--but I have met RegEx engines that refused to admit letters-with-diacritics are word characters.

If you can figure out which characters can, or, cannot, be allowed to occur after the literal + then we will be able to arrive at a solution. It might be something as simple as replacing \b with (\s|$).

No two RegEx engines use the identical syntax for \p{Alpha} \p{Punct} and so on (types of character), but that's another option to look at.

csdude55

7:11 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You and I were on the same wavelength, @lucy24 :-) It literally just tried this:

$str =~ s{
\b
ca\+
(\W)
}
{cat$1}xgi;

I don't think it's exactly the same as using a closing \b, but it works and seems to fit my purposes.

lucy24

8:41 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Whew. Depending on how evil your users are--or how bad at spelling--you might want to make the pattern
\bca\++\W
which will then also match “catt”. And if they’re really determined, \+ becomes [+†‡] and possibly other characters that the present site won’t support.

csdude55

9:24 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I seriously have maybe 5 users that seem SUPER offended that I won't let them post profanity, and they truly don't care that children frequent the site or that Adsense punishes me for it. So this is a near constant fight; they write a variation, I filter it, they write a new one.

I already filter most weird characters like † and ‡, it's the ones that aren't weird enough (like $, @, and +) that still give me a problem.

lucy24

10:53 pm on Jan 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Come to think of it ... what would happen if you simply disappeared all “weird” characters? Your users can keep saying “ca+” until the cows come home, but the site will never display anything but “ca”. If it's the kind of site where people often have occasion to post arithmetic operations, you could constrain it to those that are immediately adjacent to [:alpha:] or \p{Alpha} or whatever your dialect calls it.

csdude55

1:14 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In this particular case, I don't know if "your mother's a bich" is better or worse :-/

tangor

1:39 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



At some point you drop the user abusing the system after countless warnings. Life is too short to keep dealing with pests.

That said, a profanity filter is a good thing for the most part!

csdude55

5:30 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



At some point you drop the user abusing the system after countless warnings. Life is too short to keep dealing with pests.

Easier said than done :-/ Thanks to Incognito, VPNs, ad blockers, etc, I have no idea how to block users anymore. I can remove the username, sure, but there's no way to prevent them from creating a new one.

If you want to PM me with any ideas on how to block them, I'm all ears! LOL You obviously can't post it here, though, because then it would be found by Google and the bad guys would know.

lucy24

5:57 am on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can remove the username, sure, but
Ooh, ooh, I know this one.

User gets banned; next time they try to log on they see a popup message telling them they're no longer welcome (at least, that's what php/bb had); user sends a plaintive email to the Forum administrator lamenting that for some reason they can't seem to log on and they have no idea what the problem is.

tangor

6:56 pm on Jan 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Easier said than done :-/


It is easy. Delete the user. If they come back under another name you deal with their ACTIONS, not who they are.

Make sure your profanity rule is OBVIOUS to all posters.

You don't CARE what name they use, only what they do on your site.

If they DO come back under a different name, and behave themselves, then you've lost nothing, retained a user, and kept the peace (and quiet!).

csdude55

12:09 am on Jan 20, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You just don't know, man.

I'm a little bit of a local celebrity. Not like weatherman-level-celebrity, but it's more like... I went to a restaurant a couple of weeks ago and was chatting with the waiter, and he asked what I do for a living. I mentioned it and he kind of took a step back, had this look of awe on his face, and said, "oh wow, so you're like REALLY important! I had no idea who I was talking to! Haha!"

So the people that are causing problems are doing it BECAUSE they want to hurt me. They want to "prove" to everyone that they can outsmart me and break my rules with impunity. So what happens when I remove their username is, they come back the next day and go all scorched-earth; I wake up to discover that I have to cancel all of my appointments and spend the rest of the day cleaning up their mess.

This has happened... I guess 45 times over the last 22 years?

lucy24

12:56 am on Jan 20, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks to Incognito, VPNs, ad blockers, etc, I have no idea how to block users anymore.
But wait. We’re talking about half-a-dozen or so specific individuals. Are all of them coming in via some kind of camouflage? And are they using the same anonymizers as other, unrelated, unoffending users? * Seems like this would be something that is in your power to investigate more closely. And it might end up taking no more time and grief than tearing your hair out to arrive at the perfect word filter.

As an alternative to outright banning, it might be worth flagging all posts from this small group of users with a generic “being held for moderation”. No rush on your end; so what if it takes a day or so before you feel like sitting down with this batch of potentially offending posts.

* I once had to un-block a particular proxy because it turned out that all County employees--including those that could be expected to visit one site during business hours--went through this proxy.

csdude55

6:02 am on Jan 20, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But wait. We’re talking about half-a-dozen or so specific individuals. Are all of them coming in via some kind of camouflage?

Yes and no. Sometimes they start out trying to hide, sometimes not. Sometimes it's someone that I blocked in the past that's trying to fly under the radar but eventually acts out, sometimes it's a kid in high school.

And are they using the same anonymizers as other, unrelated, unoffending users?

That brings me to another fun little issue; the local cell phone stores, computer repair shops, and cell phone repair shops all install ad blockers and VPNs on all devices! I knew that about the computer repair shops, but only recently discovered that the cell phone places do it, too.

About 70% of my traffic uses ad blockers, but when I posted an alert asking them to please remove it, I had a TON of emails from people saying that the system was wrong and they didn't have an ad blocker. But it turned out that they DID, they just didn't know it!

But this also means that there are just too many variables to find a pattern. And sometimes, the bad hombre may not even KNOW that they're camouflaging.

Believe me, this is something I've been fighting since around 2016.

The guy that's giving me stress right now about profanity? I see that he logged in 10 times today, each time with a different IP and a different user agent. 4 of those times was with an IP that's from a common local provider, the other 6 are less common and are probably with a VPN. Or he could just on a mobile device and pinging different wifi's.

The "best" way I've found to find the bad guys is to filter certain words and phrases on new users (a new user that makes a post with my real name is a big flag). But the ones that try to come in and fly under the radar get around that one without even meaning to!

I once had to un-block a particular proxy because it turned out that all County employees--including those that could be expected to visit one site during business hours--went through this proxy.

Haha, back in the day I blocked a scammer using an AOL IP address, only to find out that ALL of my users on AOL had that same IP! LOL I woke up to hundreds of emails, because I'd inadvertently blocked about 10,000 people :-O