Named reference as a replacement in a second regex - Perl Server Side CGI Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Named reference as a replacement in a second regex

csdude55

9:02 pm on Dec 1, 2022 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This is for the second of my own education, not necessarily trying to fix a problem.

I currently have this:

$_ = '<a href='https://www.lorem.com'>https://www.ipsum.com</a>';

while (m#(<a[^>]* href=(["']).*?\2[^>]*>(.*?)</a>)#gsi) {
 $patt = $1;
 $repl = $3;

 if ($repl =~ /http/i || $patt =~ /rel=["']nofollow/i) {
  s#\Q$patt\E#$repl#gsi;
 }
}

I thought that I could smooth it out a little with a named reference of <repl>, like so:

while (m#(<a[^>]* href=(["']).*?\2[^>]*>(?<repl>.*?)</a>)#gsi) {
 $patt = $1;

 if ($+{repl} =~ /http/i || $patt =~ /rel=["']nofollow/i) {
  s#\Q$patt\E#$+{repl}#gsi;
 }
}

But the replacement doesn't work, and I bet that a lot of you immediately see why!

In s#\Q$patt\E#$+{repl}#gsi;, it's looking for <repl> in \<a\ href\=\'https\:\/\/www\.lorem\.com\'\>https\:\/\/www\.ipsum\.com\<\/a\>, which obviously doesn't exist. So I think that $+{repl} is overwritten by either null, false, undefined, or ''.

I'm pretty sure that the answer to this is "no", but is there a magic trick to make $+{repl} retain its value unless explicitly overwritten? Other than setting it to another variable, I mean?

lucy24

5:48 pm on Dec 2, 2022 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

So I think that $+{repl} is overwritten by either null, false, undefined, or ''.

But, but, splutter, on its first occurrence, <repl> hasn't had time to get overwritten by anything. Is there a preceding line where it is seeded with a starting value?

Regardless of circumstances, a non-final .*? can be iffy. That doesn't affect the current question, but it's something to look at further down the line.

csdude55

6:10 pm on Dec 2, 2022 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

But, but, splutter, on its first occurrence, <repl> hasn't had time to get overwritten by anything. Is there a preceding line where it is seeded with a starting value?

No, but if there's no <repl> then the while would fail, wouldn't it?

If I print $+{repl}; just before the s#\Q... line then it prints what I'm expecting. But then the replace returns blank as if $+{repl} is empty.

Regardless of circumstances, a non-final .*? can be iffy.

Wouldn't the ? make it stop at the first </a>? And if there's no </a> then the while would fail. Or am I totally wrong on that, too?

lucy24

6:39 pm on Dec 2, 2022 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

No, but if there's no <repl> then the while would fail, wouldn't it?

And wouldn't that mean that the loop never executes in the first place, even once, because the condition fails at the outset?

Wouldn't the ? make it stop at the first </a>?

Probably, but if you're matching an URL of some kind, it might be safer to constrain the pattern to characters that will actually occur, like:

(["'])[\w/.]+\2

adding ? and & to the group if necessary.

Here I was looking at the first .*? because I hadn't even noticed the second one. That, too, would benefit from being more closely constrained. I'd also overlooked the initial [^>]* which is more problematic as it really would affect execution speed: the RegEx captures everything up to the > and then has to backtrack �Oh, whoops, I was supposed to pick up these other patterns along the way�. Think about what might actually occur in this location ('class="classname" ' maybe?) and use it.

csdude55

1:17 am on Dec 3, 2022 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

And wouldn't that mean that the loop never executes in the first place, even once, because the condition fails at the outset?

I'm lost.

In theory, there would never be an <a href without a closing </a>, so this shouldn't be an issue. If one did exist then the entire string would become a link, which would be a whole different type of problem! LOL

In practice, elsewhere in the script I convert

[ipsum.com...]

to:

<a href='https://www.ipsum.com' target='_new'>https://www.ipsum.com</a>

Then if the user modifies their post, I remove the tag and leave the contents before converting it again. If I don't then I end up with something like:

<a href='<a href='https://www.ipsum.com' target='_new'>https://www.ipsum.com</a>'><a href='https://www.ipsum.com' target='_new'>https://www.ipsum.com</a></a>

It's also not uncommon for a user to copy text from another site and paste it to my contenteditable, so I need to be prepared to strip their tags, too. These tags can include style, class, onWhatever, data-whatever, and it's not too uncommon to see made-up attributes! So it's easier to just strip the whole tag.

So this:

<a[^>]* href=(["']).*?\2[^>]*>(.*?)</a>

is supposed to catch anything that:

1. starts with <a

2. is optionally followed by anything

3. followed by a whitespace, then href=

4. followed by either a " or '

5. followed by anything until it gets to the matching " or '; this should be an URL, so :, /, ?, =, &, %, ;, and I guess any other punctuation that could be in a query string. And since it could come from another site, there's no guarantee that it's validated

6. followed by anything until it gets to the first >, which should mark the end of the tag

Then it should match anything until it gets to the first </a>, which should close the opening tag.

And it can forget everything except for that last match.