Forum Moderators: phranque

Message Too Old, No Replies

More fun with regex, finding the matching ( ) or [ ] in a pattern

         

csdude55

8:52 pm on Oct 30, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have this string:

$string = '(?:

# word one
mï0ï
r

# or
|

# word two
mï0ï
(?:
dï0ï
)+
rï0ï
)+

# word three
(?:
a |
b
)+';


and I need to move that last ï0ï so that it'll match whether it's following word one OR word two. Like so:

$string = '(?:
(?:
# word one
mï0ï
r

# or
|

# word two
mï0ï
(?:
dï0ï
)+
r
)
ï0ï
)+

# word three
(?:
a |
b
)+';


The problem I'm having is finding the ending )+ that matches the first (. This matches the )+ after dï0ï:

\Q(?:\E[^)]+ï0ï\)

and if I go greedy then it'll match the )+ after word three:

\Q(?:\E.+?\)ï0ï\)

Can you suggest a way to match the proper closing parenthesis or bracket?

csdude55

12:15 am on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, I'm getting close. This is in Perl, but the language doesn't really matter:

@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>
[^()]+
)
|
(?1)
)*
\Q)\E
)}xsg;


This returns:

$more[0] => (?:mï0ïr|mï0ï(?:dï0ï)+rï0ï)
$more[1] => (?:a|b)

So it finds the matching ). But in this case I need to only find it when the matching ) is preceded by ï0ï, and this returns nothing:

@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>
[^()]+
)
|
(?1)
)*
\Qï0ï)\E
)}xsg;


Any thoughts on how to modify it to make it match what I'm needing?

csdude55

6:07 am on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry for 3 in a row :-O But I've learned some stuff and wanted to update.

I figured out that the magic-looking (?>...) syntax is "atomic grouping":

[regular-expressions.info...]

If I understand correctly, atomic groups will exit after the first match without trying to move on to the second match, so all it REALLY does here is make things a tad faster.

So the real magic trick is the | (?1), which I don't understand at all. The description I found says that it "recurses to bracket 1 and tries again", but I have no clue where "bracket 1" is or how to define it.

Either way, I tried to use this to figure out how to use negative lookahead instead of [^...], but I'm not sure that it's going to work. Simply changing [^()]+ to (?!\(|\))+ gave no results at all:

@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?!
\( |
\)
)+
|
(?1)
)*
\Q)\E
)}xg;


I tried using .* before and after \( and \) in every variation I could think of, but no go.

lucy24

5:58 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I tried using .* before and after \( and \) in every variation
Well, that’s definitely something to avoid if at all possible, since .* can represent absolutely anything, including the string you’re trying to match.

The description I found says that it "recurses to bracket 1 and tries again", but I have no clue where "bracket 1" is or how to define it.
Can you remember where you found this? It doesn't seem to be on the linked regex dot info page. Lacking context, I would assume “bracket 1” corresponds to whatever \1 (or $1) would be if you were capturing, typically the first open-parenthesis from the left.

This savings will be vital when your alternatives contain repeated tokens (not to mention repeated groups) that lead to catastrophic backtracking.
I like this language. It’s like when Apache docs say “unintended consequences” and you just know it really means the world as we know it will come to a crashing halt.

In the meantime, can we--haha--backtrack? What are some example strings that you’re trying to match or not match? Looking only at the RegEx, I'm getting held up on things like, why isn't
(?:mï0ïr|mï0ï(?:dï0ï)+r)
simply
(?:mï0ï(?:dï0ï)*r)

I'm more likely to make sense of it if I can start from scratch. (Oddly, this is a reply that seems to come up more often in the Apache subforum: “Please go back and explain in English what you're trying to do, and then we can get into the sample rules.”)

csdude55

7:43 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I got it to work, I think that the key was putting the recursive part in its own ( ):

@more =
$pattern =~ m{(
\Q(?:\E
(
(?:
(?>[^()]+)
|
(?1)
)*
)
ï0ï\)
)}xg;


It's definitely slower, though! Before using this expression in my profanity filter, 1000 iterations was 1.5442s; with it, I'm at 1.9403.

Using (>...) was critical, too. On small tests it worked the same, but with a bigger test it would time out. There was no way to get a speed test on it.

Can you remember where you found this? It doesn't seem to be on the linked regex dot info page. Lacking context, I would assume “bracket 1” corresponds to whatever \1 (or $1) would be if you were capturing, typically the first open-parenthesis from the left.

I took a minute to find it, I'd saved the code in my notes without the link :-( But here it is:

[perldoc.perl.org...]

/
^ # start of line
( # start capture buffer 1
< # match an opening angle bracket
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^<>]+ # one or more non angle brackets
) # end non backtracking group
| # ... or ...
(?1) # recurse to bracket 1 and try it again
)* # 0 or more times.
> # match a closing angle bracket
) # end capture buffer one
$ # end of line
/x


As I looked for the link I stumbled across this, too:

[perldoc.perl.org...]

It describes (?2) the same that you described it, though, so you appear to be right :-)

Based on that, in my code I thought that I could use (?2) instead of (?1) and save a few ms but that created an infinite loop. I'm not entirely sure why, but I wanted to note it here for future reader.

What are some example strings that you’re trying to match or not match? Looking only at the RegEx, I'm getting held up on things like, why isn't
(?:mï0ïr|mï0ï(?:dï0ï)+r)
simply
(?:mï0ï(?:dï0ï)*r)

Haha, I did my best to make it simple enough to read but still functional! LOL The actual string I was testing with was:

(?:mï0ïï3ïï0ïï4ïï0ïhï0ïï5ïï0ïr|mï0ïï6ïï0ï(?:dï0ï)+ï5ïï0ïrï0ï)?(?:f|pï0ïhï0ï)(?:ï1ïï0ï)+(?:(?:cï0ï)?k|qï0ï)+(?:ï6ïï0ïï5ï|r|ï4ïï0ïï7ïï0ïrï0ïd|cï0ïn|jï0ïï3ïï0ïï5ï|ï2ïï0ï(?:ï8ï|ï6ïï0ïrï0ï)dï0ïï5ïï0ïn)*

The full script gets super complicated, I'm using ï\d+ï as a placeholder for anything in [...] or (?!...).

Here's my test script for this section in its entirety if you want to copy and paste it anywhere:

$pattern = '(?:mï0ïï3ïï0ïï4ïï0ïhï0ïï5ïï0ïr|mï0ïï6ïï0ï(?:dï0ï)+ï5ïï0ïrï0ï)?(?:f|pï0ïhï0ï)(?:ï1ïï0ï)+(?:(?:cï0ï)?k|qï0ï)+(?:ï6ïï0ïï5ï|r|ï4ïï0ïï7ïï0ïrï0ïd|cï0ïn|jï0ïï3ïï0ïï5ï|ï2ïï0ï(?:ï8ï|ï6ïï0ïrï0ï)dï0ïï5ïï0ïn)*';

$pattern =~ s{(
\Q(?:\E
(
(?:
[^()]+
|
(?1)
)*
)
ï0ï\)
)}
{(?:(?:$2)ï0ï)}x;

print $pattern;


Since it's working right now, though, I'm going to spend a few minutes (give or take) tweaking everything, and eventually I'll create a thread in the Perl section with the entire profanity filter in its entirety. It's WAY more complicated than I intended, and really a lot slower than I wanted, so it's probably going to be an ongoing work in progress.

[edited by: phranque at 7:56 pm (utc) on Oct 31, 2021]
[edit reason] disable graphic smile faces [/edit]

lucy24

9:00 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Holy ###. What do all those ï (diacritic which doesn’t occur in most languages I can think of, and is exceedingly rare in the rest, like “” indicating two syllables rather than the usual diphthong) represent in real life? It all makes me think of when OCR encounters an unexpected “ü” and, not knowing we’re not in English any more, bravely renders it as “ii”.

Or are we delving into cuss words I have never encountered and may be too old to learn?

csdude55

9:41 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wait... I spoke too soon. I would have bet money that it was working properly last night, but now it's not :-/

This works as expected:

$pattern = 'ï3ï+(?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?ï8ï|(?:ï7ï)?';

# should return anything that starts with (?: and ends with a matching )?
@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>[^()]+) |
(?1)
)*
\)\?
)}xsg;

for (@more) {
print "$_\n";
}

# Returns:
# (?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?
# (?:ï7ï)?


But this returns nothing:

$pattern = 'ï3ï+(?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?ï8ï|(?:ï7ï)?';

# should return anything that starts with (?: and ends with a matching ï)?
@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>[^()]+) |
(?1)
)*
ï\)
)}xsg;

for (@more) {
print "$_\n";
}


I'm expecting it to return the same thing, since the results in the first one all end with ï). I'm at a complete loss as to why one works but the other doesn't :-/


Holy ###. What do all those ï (diacritic which doesn’t occur in most languages I can think of, and is exceedingly rare in the rest, like “oï” indicating two syllables rather than the usual diphthong) represent in real life?

Oh, Lucy... you just have no idea how complicated my script has gotten! LOL It got to a point where I would run one expression to modify the original, but then a second expression would run over the things that I'd added in the original and mess things up.

To get around this, first I convert all ï to i in the original text, then I use ï[0-9]+ï in place of the text that I want to protect from future expressions (with the original stored in a corresponding $hash[[0-9]+]). Then after it's all done, I convert the ï[0-9+ï back to the original.

The ï0ï (the first one in the array) represents:

(?:\W(?!\s\b)|<.+?>)*

I shove this between each character to match f-o-o, f<br>o<br>o, etc. Since I would later convert all "s" to "[s\$z]" (to match "thi$" the same as "this") I didn't want the expression to change that to (?:\W(?!\[s\$z]\b)|<.+?>)* so I used a safe alternate.

[edited by: phranque at 5:34 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]

csdude55

10:03 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In retrospect, I said that this one is working correctly but... not exactly:

$pattern = 'ï3ï+(?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?ï8ï|(?:ï7ï)?';

# should return anything that starts with (?: and ends with a matching )?
@more =
$pattern =~ m{(
\Q(?:\E
(?:
(?>[^()]+) |
(?1)
)*
\)\?
)}xsg;

for (@more) {
print "$_\n";
}

# Returns:
# (?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?
# (?:ï7ï)?


It returns the very last group of (?:ï7ï)?, but NOT the one that's nested inside of (?:ï9ï|ï10ï(?:ï7ï)?ï7ï|hï6ï)?

Sheesh. Maybe I'm going down the wrong road here...

csdude55

10:38 pm on Oct 31, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm gonna back up a few feet, please don't waste unnecessary time on this one as I don't think it's the right solution for my problem, after all. I appreciate the help, Lucy!

lucy24

2:30 am on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But wait ...
ï3ï+(?:ï9ï|ï10ï
(et cetera). That part would never match, because ï+ has already gobbled up all the ï so there would be nothing left for the ï lookahead.

And this
(?:\W(?!\s\b)
(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning.

In some situations you might find it useful to change original - hyphen either to _ lowline (a word character, so things like “###-face” would be read as one word) or to, say, ~ tilde. That way you no longer have to worry about special meanings of - in RegEx brackets.

Final, disheartening thought: The time you spend devising all these tests and making everything foolproof . . . will be easily matched by a small subset of forum users moving heaven and earth to devise methods of bypassing the filters.

don't waste unnecessary time
Time spent playing with Regular Expressions is never wasted ;)

csdude55

3:54 am on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That part would never match, because ï+ has already gobbled up all the ï so there would be nothing left for the ï lookahead.

At this point, $pattern is just a string that I'm manipulating. When I use it as a regex later, ï3ï+ will be converted to something like [a@]+.

And this
(?:\W(?!\s\b)
(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning.

Makes sense. My original was simply:

(?\W|<.+?>)*

but then it was catching the last whitespace at the end of the word when I didn't want it to. My (?!\s\b) was an attempt to stop that.

This comes closer:

(?:[^\w\s]|<.+?>)*

but then if someone uses a whitespace to get around it (eg, "f o o") then it doesn't match.

Any suggestions on how to modify it so that it'll match "foo", "f*o*o", f<br>o<br>o", "f o o", etc, but not match any trailing whitespace or punctuation?

This is the basis of the other thread I created, trying to fix this short expression to match everything like I'm expecting. If I can get it to work properly then I can eliminate another regex AND I won't need to worry about recursion :-)

Final, disheartening thought: The time you spend devising all these tests and making everything foolproof . . . will be easily matched by a small subset of forum users moving heaven and earth to devise methods of bypassing the filters.

Gah, tell me about it! LOL Worse, when they DO slip through, somehow everyone blames ME for it! It's a never ending battle :-/

[edited by: phranque at 5:35 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]

csdude55

4:23 am on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Time spent playing with Regular Expressions is never wasted ;)

Ya know, I really have learned a lot about regex in the last 3 weeks of working on this script! It is kinda fun, but I'll appreciate it a lot more when it's actually working right! LOL

I'm sure you know this, but you (Lucy) are the resident regex pro around here :-D When I post a regex question there's no doubt that you'll be the first (probably only) one to reply. I remember awhile back I asked a regex question, and the response was "just wait for Lucy to get here" LOL

lucy24

6:16 pm on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Tee hee.

<.+?>
It might be preferable to say
<[^>]+>
But either way you have to consider people saying, er, &lt; (form I actually use in Disqus-based forums because they’re coded to auto-convert anything in <angle brackets> whether it’s an attested html tag or not).

but not match any trailing whitespace or punctuation
I don’t know if it would work to do it as packages instead. If your set of all possible intervening characters is \q--locution invented at random to represent
([\W_]|<[^<>]+>)
--then you’re looking at
\w(\q\w)+

How long is your list of Bad Words? I mean the underlying words, not their disguises. I suppose it’s more than George Carlin’s seven (one of which you can now say on ordinary broadcast TV).

I think it may become necessary to strip away any and all <blahblah> first, because otherwise the “blahblah”--which most often will consist of nothing but word characters--will merge into the surrounding word:
foo<i>bar</i>
>>
fooibari
creating a false negative. (Which, of course, is one way to sneak in Language on the present site, except we’re all grownups and would never dream of doing such a thing.) Or else \W has to be replaced with [^\w<>].

csdude55

9:08 pm on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How long is your list of Bad Words? I mean the underlying words, not their disguises.

My current list has 44 words, not including, for example:

(?:mother|mudder)...

or
...(?:hole|clown)

I've also included racist terms, vulgar words and phrases, some non-English words and phrases, and a very few other words and phrases that needed to be removed for some other reason (like, when someone tries to libel a specific person and I need to filter that person's name).

The final pattern that I'm building is currently about 10,000 characters long :-O

I think it may become necessary to strip away any and all <blahblah> first, because otherwise the “blahblah”--which most often will consist of nothing but word characters--will merge into the surrounding word:
foo<i>bar</i>
>>
fooibari
creating a false negative.

Do you mean, remove all HTML tags from the string, then delete the (<.+?>) from the pattern? I can do that; I remove [...] and (?!...) from the pattern and plug it back in later using the ï\d+ï), so I could apply the same principle to the string.

[edited by: phranque at 4:44 am (utc) on Nov 2, 2021]
[edit reason] disable graphic smile faces [/edit]

lucy24

10:29 pm on Nov 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Assuming the site uses html-style markup with angle < > brackets, that’s what you would strip out. If it uses php/bb-style markup with square [ ] brackets like the present site, you'd strip out those instead. It's the difference between

foo<b>ar
and
foo[b]ar

(and yes, I had to sneak in an extra pair of format tags to keep the site from yapping at me about [ b ] in the second version!)

Pattern:
</?\w+>