Forum Moderators: coopster & phranque

Message Too Old, No Replies

Counting nested () in a regex

         

csdude55

9:27 pm on Jan 4, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I admit, regex isn't always my wheelhouse :(

In this one, I have a nested round bracket:

s/(\b)(\@ss|a[^\w]*[s\$]([^\w]*[s\$])+)(\b)/$1***$3/gi;

# To break it down for legibility:
(\b)
(
\@ss
|
a[^\w]*
[s\$]
(
[^\w]*[s\$]
)+
)
(\b)
/$1***$3/gi;


So when I'm counting, am I correct that the first (\b) is $1, then (\@ss|a[^\w]*[s\$]([^\w]*[s\$])+) is $2, the last (\b) is $3, and then the nested ([^\w]*[s\$])+) inside of $2 is $4?

lucy24

9:39 pm on Jan 4, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wha, wha, wha--. Why are you capturing \b in the first place? It's an anchor.

The exact content of $1 and $3 (actually $4, see below) may depend on your RegEx engine. I tried it in one text editor just to verify that it wouldn't result in an error message. It yielded null captures--for example if your replacement involves “$1” then you'll end up with “” with no other changes.

am I correct that

In general, no, but again you'll want to check the behavior of your specific RegEx engine. Captures are ordinarily counted from left to right, looking only at the opening parenthesis, so if you've got
(blahblah(otherstuff))(stillmorestuff)
then
$1 is "blahblahotherstuff"
$2 is "otherstuff"
$3 is "stillmorestuff"

Edit: When I first saw the subject line I thought it would be a question about checking for mismatched parentheses in the test string--something that has occasionally come up when looking at questionable User-Agents in access controls. (The conclusion there was that it can be done, but probably isn't worth the trouble.)

csdude55

10:04 pm on Jan 4, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wha, wha, wha--. Why are you capturing \b in the first place? It's an anchor.


You know, I honestly don't remember what my logic was... that particular regex is at least a few years old. I'm sure there was some logic that made sense to me at the time, but like I said, regex isn't exactly my wheelhouse :/

Thanks for the explanation, though! That makes perfect sense.

keyplyr

11:21 pm on Jan 4, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why are you capturing \b in the first place? It's an anchor
It's a line break.

lucy24

12:57 am on Jan 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's a line break.

In general a line break is \n -- or possibly \r or \r\n in certain Mac-related circumstances*. Is there some RegEx dialect where \b means a Unicode Line Separator?

Poring over [regular-expressions.info...] I find one dialect where \b is a literal backspace character (word boundary is \y), though I can't honestly think of anything you'd be less likely to want to match! (Hm. Maybe in obfuscated scripts involving injection attempts by malign robots?)


* Long ago, I was obliged to use \r\n line breaks for putative cross-platform compatibility. Major annoyance, because it meant that the $ anchor didn't work. This may have changed later.

keyplyr

1:02 am on Jan 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I use it in jquerry, but that may not be regex per se.

upside

6:01 am on Jan 15, 2017 (gmt 0)

10+ Year Member



To reinforce what lucy24 said, Perl defines the \b metacharacter as matching a word boundary and considers the match to be zero-width. In other words, (\b) will always capture nothing.