More fun with regex, finding the matching ( ) or [ ] in a pattern

Forum Moderators: phranque

Message Too Old, No Replies

More fun with regex, finding the matching ( ) or [ ] in a pattern

csdude55

8:52 pm on Oct 30, 2021 (gmt 0)

I have this string:

$string = '(?:

 # word one
 mī0ī
 r

 # or
 |

 # word two
 mī0ī
 (?:
  dī0ī
 )+
 rī0ī
)+

# word three
(?:
 a |
 b
)+';

and I need to move that last ī0ī so that it'll match whether it's following word one OR word two. Like so:

$string = '(?:
 (?:
  # word one
  mī0ī
  r

  # or
  |

  # word two
  mī0ī
  (?:
   dī0ī
  )+
  r
 )
 ī0ī
)+

# word three
(?:
 a |
 b
)+';

The problem I'm having is finding the ending )+ that matches the first (. This matches the )+ after dī0ī:

\Q(?:\E[^)]+ī0ī\)

and if I go greedy then it'll match the )+ after word three:

\Q(?:\E.+?\)ī0ī\)

Can you suggest a way to match the proper closing parenthesis or bracket?

csdude55

12:15 am on Oct 31, 2021 (gmt 0)

Well, I'm getting close. This is in Perl, but the language doesn't really matter:

@more =
  $pattern =~ m{(
    \Q(?:\E
    (?:
      (?>
        [^()]+
      )
      |
      (?1)
    )*
    \Q)\E
  )}xsg;

This returns:

$more[0] => (?:mī0īr|mī0ī(?:dī0ī)+rī0ī)
$more[1] => (?:a|b)

So it finds the matching ). But in this case I need to only find it when the matching ) is preceded by ī0ī, and this returns nothing:

@more =
  $pattern =~ m{(
    \Q(?:\E
    (?:
      (?>
        [^()]+
      )
      |
      (?1)
    )*
    \Qī0ī)\E
  )}xsg;

Any thoughts on how to modify it to make it match what I'm needing?

csdude55

6:07 am on Oct 31, 2021 (gmt 0)

Sorry for 3 in a row :-O But I've learned some stuff and wanted to update.

I figured out that the magic-looking (?>...) syntax is "atomic grouping":

[regular-expressions.info...]

If I understand correctly, atomic groups will exit after the first match without trying to move on to the second match, so all it REALLY does here is make things a tad faster.

So the real magic trick is the | (?1), which I don't understand at all. The description I found says that it "recurses to bracket 1 and tries again", but I have no clue where "bracket 1" is or how to define it.

Either way, I tried to use this to figure out how to use negative lookahead instead of [^...], but I'm not sure that it's going to work. Simply changing [^()]+ to (?!$|$)+ gave no results at all:

@more =
 $pattern =~ m{(
 \Q(?:\E
 (?:
  (?!
  \( |
  \)
  )+
  |
  (?1)
 )*
 \Q)\E
 )}xg;

I tried using .* before and after $ and $ in every variation I could think of, but no go.

lucy24

5:58 pm on Oct 31, 2021 (gmt 0)

I tried using .* before and after $ and $ in every variation

Well, thats definitely something to avoid if at all possible, since .* can represent absolutely anything, including the string youre trying to match.

The description I found says that it "recurses to bracket 1 and tries again", but I have no clue where "bracket 1" is or how to define it.

Can you remember where you found this? It doesn't seem to be on the linked regex dot info page. Lacking context, I would assume bracket 1 corresponds to whatever \1 (or $1) would be if you were capturing, typically the first open-parenthesis from the left.

This savings will be vital when your alternatives contain repeated tokens (not to mention repeated groups) that lead to catastrophic backtracking.

I like this language. Its like when Apache docs say unintended consequences and you just know it really means the world as we know it will come to a crashing halt.

In the meantime, can we--haha--backtrack? What are some example strings that youre trying to match or not match? Looking only at the RegEx, I'm getting held up on things like, why isn't
(?:mī0īr|mī0ī(?:dī0ī)+r)
simply
(?:mī0ī(?:dī0ī)*r)

I'm more likely to make sense of it if I can start from scratch. (Oddly, this is a reply that seems to come up more often in the Apache subforum: Please go back and explain in English what you're trying to do, and then we can get into the sample rules.)

csdude55

7:43 pm on Oct 31, 2021 (gmt 0)

I got it to work, I think that the key was putting the recursive part in its own ( ):

@more =
 $pattern =~ m{(
 \Q(?:\E
 (
  (?:
   (?>[^()]+)
   |
   (?1)
  )*
 )
 ī0ī\)
)}xg;

It's definitely slower, though! Before using this expression in my profanity filter, 1000 iterations was 1.5442s; with it, I'm at 1.9403.

Using (>...) was critical, too. On small tests it worked the same, but with a bigger test it would time out. There was no way to get a speed test on it.

Can you remember where you found this? It doesn't seem to be on the linked regex dot info page. Lacking context, I would assume bracket 1 corresponds to whatever \1 (or $1) would be if you were capturing, typically the first open-parenthesis from the left.

I took a minute to find it, I'd saved the code in my notes without the link :-( But here it is:

[perldoc.perl.org...]

/
 ^      # start of line
 (      # start capture buffer 1
 <     # match an opening angle bracket
 (?:     # match one of:
  (?>    #  don't backtrack over the inside of this group
   [^<>]+  #  one or more non angle brackets
  )    #  end non backtracking group
 |     #  ... or ...
  (?1)   #  recurse to bracket 1 and try it again
 )*     # 0 or more times.
 >     # match a closing angle bracket
 )      # end capture buffer one
 $      # end of line
/x

As I looked for the link I stumbled across this, too:

[perldoc.perl.org...]

It describes (?2) the same that you described it, though, so you appear to be right :-)

Based on that, in my code I thought that I could use (?2) instead of (?1) and save a few ms but that created an infinite loop. I'm not entirely sure why, but I wanted to note it here for future reader.

What are some example strings that youre trying to match or not match? Looking only at the RegEx, I'm getting held up on things like, why isn't
(?:mī0īr|mī0ī(?:dī0ī)+r)
simply
(?:mī0ī(?:dī0ī)*r)

Haha, I did my best to make it simple enough to read but still functional! LOL The actual string I was testing with was:

(?:mī0īī3īī0īī4īī0īhī0īī5īī0īr|mī0īī6īī0ī(?:dī0ī)+ī5īī0īrī0ī)?(?:f|pī0īhī0ī)(?:ī1īī0ī)+(?:(?:cī0ī)?k|qī0ī)+(?:ī6īī0īī5ī|r|ī4īī0īī7īī0īrī0īd|cī0īn|jī0īī3īī0īī5ī|ī2īī0ī(?:ī8ī|ī6īī0īrī0ī)dī0īī5īī0īn)*

The full script gets super complicated, I'm using ī\d+ī as a placeholder for anything in [...] or (?!...).

Here's my test script for this section in its entirety if you want to copy and paste it anywhere:

$pattern = '(?:mī0īī3īī0īī4īī0īhī0īī5īī0īr|mī0īī6īī0ī(?:dī0ī)+ī5īī0īrī0ī)?(?:f|pī0īhī0ī)(?:ī1īī0ī)+(?:(?:cī0ī)?k|qī0ī)+(?:ī6īī0īī5ī|r|ī4īī0īī7īī0īrī0īd|cī0īn|jī0īī3īī0īī5ī|ī2īī0ī(?:ī8ī|ī6īī0īrī0ī)dī0īī5īī0īn)*';

$pattern =~ s{(
 \Q(?:\E
 (
  (?:
   [^()]+
   |
   (?1)
  )*
 )
 ī0ī\)
)}
{(?:(?:$2)ī0ī)}x;

print $pattern;

Since it's working right now, though, I'm going to spend a few minutes (give or take) tweaking everything, and eventually I'll create a thread in the Perl section with the entire profanity filter in its entirety. It's WAY more complicated than I intended, and really a lot slower than I wanted, so it's probably going to be an ongoing work in progress.

[edited by: phranque at 7:56 pm (utc) on Oct 31, 2021]
[edit reason] disable graphic smile faces [/edit]

lucy24

9:00 pm on Oct 31, 2021 (gmt 0)

Holy ###. What do all those ī (diacritic which doesnt occur in most languages I can think of, and is exceedingly rare in the rest, like oī indicating two syllables rather than the usual diphthong) represent in real life? It all makes me think of when OCR encounters an unexpected ü and, not knowing were not in English any more, bravely renders it as ii.

Or are we delving into cuss words I have never encountered and may be too old to learn?

csdude55

9:41 pm on Oct 31, 2021 (gmt 0)

Wait... I spoke too soon. I would have bet money that it was working properly last night, but now it's not :-/

This works as expected:

$pattern = 'ī3ī+(?:ī9ī|ī10ī(?:ī7ī)?ī7ī|hī6ī)?ī8ī|(?:ī7ī)?';

# should return anything that starts with (?: and ends with a matching )?
@more =
 $pattern =~ m{(
 \Q(?:\E
  (?:
  (?>[^()]+)  |
  (?1)
  )*
 \)\?
 )}xsg;

for (@more) {
 print "$_\n";
}

# Returns:
# (?:ī9ī|ī10ī(?:ī7ī)?ī7ī|hī6ī)?
# (?:ī7ī)?

But this returns nothing:

$pattern = 'ī3ī+(?:ī9ī|ī10ī(?:ī7ī)?ī7ī|hī6ī)?ī8ī|(?:ī7ī)?';

# should return anything that starts with (?: and ends with a matching ī)?
@more =
 $pattern =~ m{(
 \Q(?:\E
  (?:
  (?>[^()]+)  |
  (?1)
  )*
 ī\)
 )}xsg;

for (@more) {
 print "$_\n";
}

I'm expecting it to return the same thing, since the results in the first one all end with ī). I'm at a complete loss as to why one works but the other doesn't :-/

Holy ###. What do all those ī (diacritic which doesnt occur in most languages I can think of, and is exceedingly rare in the rest, like oī indicating two syllables rather than the usual diphthong) represent in real life?

Oh, Lucy... you just have no idea how complicated my script has gotten! LOL It got to a point where I would run one expression to modify the original, but then a second expression would run over the things that I'd added in the original and mess things up.

To get around this, first I convert all ī to i in the original text, then I use ī[0-9]+ī in place of the text that I want to protect from future expressions (with the original stored in a corresponding $hash[[0-9]+]). Then after it's all done, I convert the ī[0-9+ī back to the original.

The ī0ī (the first one in the array) represents:

(?:\W(?!\s\b)|<.+?>)*

I shove this between each character to match f-o-o, f o o, etc. Since I would later convert all "s" to "[s\$z]" (to match "thi$" the same as "this") I didn't want the expression to change that to (?:\W(?!\[s\$z]\b)|<.+?>)* so I used a safe alternate.

[edited by: phranque at 5:34 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]

csdude55

10:03 pm on Oct 31, 2021 (gmt 0)

In retrospect, I said that this one is working correctly but... not exactly:

$pattern = 'ī3ī+(?:ī9ī|ī10ī(?:ī7ī)?ī7ī|hī6ī)?ī8ī|(?:ī7ī)?';

# should return anything that starts with (?: and ends with a matching )?
@more =
 $pattern =~ m{(
 \Q(?:\E
  (?:
  (?>[^()]+)  |
  (?1)
  )*
 \)\?
 )}xsg;

for (@more) {
 print "$_\n";
}

# Returns:
# (?:ī9ī|ī10ī(?:ī7ī)?ī7ī|hī6ī)?
# (?:ī7ī)?

It returns the very last group of (?:ī7ī)?, but NOT the one that's nested inside of (?:ī9ī|ī10ī(?:ī7ī)?ī7ī|hī6ī)?

Sheesh. Maybe I'm going down the wrong road here...

csdude55

10:38 pm on Oct 31, 2021 (gmt 0)

I'm gonna back up a few feet, please don't waste unnecessary time on this one as I don't think it's the right solution for my problem, after all. I appreciate the help, Lucy!

lucy24

2:30 am on Nov 1, 2021 (gmt 0)

But wait ...

ī3ī+(?:ī9ī|ī10ī

(et cetera). That part would never match, because ī+ has already gobbled up all the ī so there would be nothing left for the ī lookahead.

And this

(?:\W(?!\s\b)

(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning.

In some situations you might find it useful to change original - hyphen either to _ lowline (a word character, so things like ###-face would be read as one word) or to, say, ~ tilde. That way you no longer have to worry about special meanings of - in RegEx brackets.

Final, disheartening thought: The time you spend devising all these tests and making everything foolproof . . . will be easily matched by a small subset of forum users moving heaven and earth to devise methods of bypassing the filters.

don't waste unnecessary time

Time spent playing with Regular Expressions is never wasted ;)

csdude55

3:54 am on Nov 1, 2021 (gmt 0)

That part would never match, because ī+ has already gobbled up all the ī so there would be nothing left for the ī lookahead.

At this point, $pattern is just a string that I'm manipulating. When I use it as a regex later, ī3ī+ will be converted to something like [a@]+.

And this
(?:\W(?!\s\b)
(et cetera). Since \W and \s are by definition non-word characters, the form \s\b would have no meaning.

Makes sense. My original was simply:

(?\W|<.+?>)*

but then it was catching the last whitespace at the end of the word when I didn't want it to. My (?!\s\b) was an attempt to stop that.

This comes closer:

(?:[^\w\s]|<.+?>)*

but then if someone uses a whitespace to get around it (eg, "f o o") then it doesn't match.

Any suggestions on how to modify it so that it'll match "foo", "f*o*o", f o o", "f o o", etc, but not match any trailing whitespace or punctuation?

This is the basis of the other thread I created, trying to fix this short expression to match everything like I'm expecting. If I can get it to work properly then I can eliminate another regex AND I won't need to worry about recursion :-)

Final, disheartening thought: The time you spend devising all these tests and making everything foolproof . . . will be easily matched by a small subset of forum users moving heaven and earth to devise methods of bypassing the filters.

Gah, tell me about it! LOL Worse, when they DO slip through, somehow everyone blames ME for it! It's a never ending battle :-/

[edited by: phranque at 5:35 am (utc) on Nov 1, 2021]
[edit reason] disable graphic smile faces [/edit]

csdude55

4:23 am on Nov 1, 2021 (gmt 0)

Time spent playing with Regular Expressions is never wasted ;)

Ya know, I really have learned a lot about regex in the last 3 weeks of working on this script! It is kinda fun, but I'll appreciate it a lot more when it's actually working right! LOL

I'm sure you know this, but you (Lucy) are the resident regex pro around here :-D When I post a regex question there's no doubt that you'll be the first (probably only) one to reply. I remember awhile back I asked a regex question, and the response was "just wait for Lucy to get here" LOL

lucy24

6:16 pm on Nov 1, 2021 (gmt 0)

Tee hee.

<.+?>

It might be preferable to say

<[^>]+>

But either way you have to consider people saying, er, < (form I actually use in Disqus-based forums because theyre coded to auto-convert anything in <angle brackets> whether its an attested html tag or not).

but not match any trailing whitespace or punctuation

I dont know if it would work to do it as packages instead. If your set of all possible intervening characters is \q--locution invented at random to represent

([\W_]|<[^<>]+>)

--then youre looking at
\w(\q\w)+

How long is your list of Bad Words? I mean the underlying words, not their disguises. I suppose its more than George Carlins seven (one of which you can now say on ordinary broadcast TV).

I think it may become necessary to strip away any and all <blahblah> first, because otherwise the blahblah--which most often will consist of nothing but word characters--will merge into the surrounding word:
foobar
>>
fooibari
creating a false negative. (Which, of course, is one way to sneak in Language on the present site, except were all grownups and would never dream of doing such a thing.) Or else \W has to be replaced with [^\w<>].

csdude55

9:08 pm on Nov 1, 2021 (gmt 0)

How long is your list of Bad Words? I mean the underlying words, not their disguises.

My current list has 44 words, not including, for example:

(?:mother|mudder)...

or
...(?:hole|clown)

I've also included racist terms, vulgar words and phrases, some non-English words and phrases, and a very few other words and phrases that needed to be removed for some other reason (like, when someone tries to libel a specific person and I need to filter that person's name).

The final pattern that I'm building is currently about 10,000 characters long :-O

I think it may become necessary to strip away any and all <blahblah> first, because otherwise the blahblah--which most often will consist of nothing but word characters--will merge into the surrounding word:
foobar
>>
fooibari
creating a false negative.

Do you mean, remove all HTML tags from the string, then delete the (<.+?>) from the pattern? I can do that; I remove [...] and (?!...) from the pattern and plug it back in later using the ī\d+ī), so I could apply the same principle to the string.

[edited by: phranque at 4:44 am (utc) on Nov 2, 2021]
[edit reason] disable graphic smile faces [/edit]

lucy24

10:29 pm on Nov 1, 2021 (gmt 0)

Assuming the site uses html-style markup with angle < > brackets, thats what you would strip out. If it uses php/bb-style markup with square [ ] brackets like the present site, you'd strip out those instead. It's the difference between

fooar
and
foo[b]ar

(and yes, I had to sneak in an extra pair of format tags to keep the site from yapping at me about [ b ] in the second version!)

Pattern:

</?\w+>