Forum Moderators: phranque

Message Too Old, No Replies

Yet more fun with regex: getting $2 inside of THIS group

         

csdude55

9:04 pm on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hopefully I'll be taking a break from regex after this one!

I have a pattern that looks like this:

$pattern = <<EOF;
(?:
(
b[i!1y]rd |
c[o0]w
)?
food
|
(
b[a@*]+d+ |
c[a@*]ndy
)?
apple
)
EOF


And the string:

$str = 'my candyapple';


I want to substitute, for example, "birdfood" to "birdcorn", and "candyapple" to "candycorn".

The expression:

s{(\b|^|[\s.,;:'"]|\$)$pattern(\b|[\s.,;:'"]|$)}
{$1$2corn$+}xgi;


(the complicated (\b|^|[\s.,;:'"]|\$) became necessary when dealing with non-alphanumeric characters)

The problem here is that $2 equals "", so "candy" becomes $3 instead of $2. But if $str = 'birdfood' then $2 equals "bird".

I found that I can sorta/kinda make it work if I name the groups and use %+, like so:

$pattern= <<EOF;
(?:
(?<PLACEHOLDER>
b[i!1y]rd|
c[o0]w
)?
food
|
(?<PLACEHOLDER>
b[a@*]+d+|
c[a@*]ndy
)?
apple
)
EOF

$_ = 'my candyapple';

s{(\b|^|[\s.,;:'"]|\$)$pattern(\b|[\s.,;:'"]|$)}
{$1$+{PLACEHOLDER}corn$+}xgi;


So I know it "can" be done, the data is stored somewhere... I just can't find where.

The docs imply that I can use $+[2] (or maybe $-[2]), but these just return the offset of the result instead of the value:

[perldoc.perl.org...]

Any other suggestions on how to get what I need without going back and naming all of the groups with the same name?

lucy24

10:11 pm on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hopefully I'll be taking a break from regex after this one!
Yah, when I saw the third regex-related post in my Unread list, my first thought was OK, now you’re just trolling :)

(\b|^|[\s.,;:'"]|\$)
(\b|[\s.,;:'"]|$)
If I'm understanding it right ... is the issue that some of the pipe-delimited items are capturable strings while the others are anchors? But I think I'm missing something anyway, because in each case \b would subsume everything else in the list. (It also unfortunately includes hyphens, so “foo-bar” would be perceived as two words whether you want it to or not.)

Why can’t the dollar sign \$ be included with the punctuation marks?

Seems like you’d instead want something like
([.,:;“”]*\bblahblah)
if you want to include the punctuation in the capture, else
[.,:;“”]*\b(blahblah)
where blahblah is the pattern, assuming it begins with a \w character. Note position of the \b which has to be immediately adjacent to the word.

:: vague mental association with my frequently-used pattern
(\w[\p{Alpha}’-]*)((?:</i>)?\p{Punct}*) ?\[\*\* ?(?:error|typo) for ([^\]]+)\]
used in constructing errata lists, where any punctuation adjoining a word needs to be preserved separately ::

:: further detour to regex-info site ::

If I’m reading it right, Perl actually supports both of the forms \p{Punct} and [[:punct:]] but it can’t do && intersections, more’s the pity.

csdude55

10:37 pm on Nov 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yah, when I saw the third regex-related post in my Unread list, my first thought was OK, now you’re just trolling :)

Haha, no, I'm just THIS close to being burned out! LOL My rule of thumb is to work on it for at least a day before asking for help... and I usually prefer to sleep on it, because sometimes a fresh mind will see something I was missing. This script has gotten SUPER complicated, though.

If I'm understanding it right ... is the issue that some of the pipe-delimited items are capturable strings while the others are anchors? But I think I'm missing something anyway, because in each case \b would subsume everything else in the list. (It also unfortunately includes hyphens, so “foo-bar” would be perceived as two words whether you want it to or not.)

I mean, that's a separate issue, but... yeah :-) It's not really related to the main question of the thread, just another thing I'm dealing with.

When I was testing with non-alphanumeric (like "#$$") \b wasn't recognizing the boundaries at all, which is why I plugged in the alternates. My theory here was, "try \b first, but if that doesn't work then try these next". Am I right on that?

Why can’t the dollar sign \$ be included with the punctuation marks?

Cause I coded it at like 3am... LOL

If I’m reading it right, Perl actually supports both of the forms \p{Punct} and [[:punct:]]

Awesome, thanks! I had to look it up... [[:punct:]] matches:
[-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]

So I still had to do this, but it's both more accurate and shorter:
(\b|^|[[:punct:]]|[\s\$])