Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regex to convert URLs to links

         

csdude55

5:03 am on May 31, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been using this code for several years, but I've had to make patch after patch to keep up with random glitches. I'm hoping maybe a fresh set of eyes can help me see where it could be improved?

# Usage: $text = links($text);

sub links {
require URI::Find;
URI::Find->import();

local($text) = @_;

## Remove existing link code

# Patch # 1, </a>&nbsp; wasn't being caught so I convert the &nbsp; to a whitespace
$text =~ s#</a>&nbsp;#</a> #g;

while ($text =~ m#(<a[^>]* href=(["']).*?\2[^>]*?>(.*?)</a>)#gi) {
$pattern = $1;
$repl = $3;
$pattern = quotemeta($pattern);

if ($repl =~ /http/i || $pattern =~ /rel=["']nofollow/i) {
$text =~ s#$pattern#$repl#gsi;
}
}

## Convert www to http://www

# Patch # 2, trying to auto-correct www.example.com to http://www.example.com
# one problem, there's one site that uses a link like
# http://www.example.com/foo/www.example.com/bar.pdf
# which becomes
# http://www.example.com/foo/http://www.example.com/bar.pdf
# so the link is broken

# lookbehind is fixed-length, so I have to use 2 lines to get both http and https
$text =~ s#(?<!http://)www\.([a-z])#http://www\.$1#gi;
$text =~ s#(?<!https://)www\.([a-z])#https://www\.$1#gi;

# maybe use a \b word boundary?
# s#\b(?<!https://)www\.([a-z])#https://www\.$1#gi;

## Create Links
$finder = URI::Find -> new(
sub {
($uri, $orig_uri) = @_;

$uri =~ s/&nbsp;*$//;
$orig_uri =~ s/&nbsp;*$//;

# Patch # 3, ignore images from my site so I don't end up with
# <img src="<a href='blah' target='_new'>blah</a>">
if ($orig_uri =~ /$home(.*?)\.(jpg|jpeg|png|gif|bmp)$/i &&
$orig_uri !~ /cache/i) { return $orig_uri; }

else {
# Remove utm_, then trailing ? or &
$uri =~ s#utm_\w+=[^&]+(&(amp;)*)*##gi;
$uri =~ s#[?&]$##;

$orig_uri =~ s#utm_\w+=[^&]+(&(amp;)*)*##gi;
$orig_uri =~ s#[?&]$##;

# turn long links into something like https://www.example.com/.../bar/
if (length($orig_uri) > 40) {
$orig_uri = substr($orig_uri, 0, 27) . '...' . substr($orig_uri, -10);
}

return "<a href='$uri' target='_blank'>$orig_uri</a>";
}
}
);
$finder -> find(\$text);

# Patch # 4, fix nested links
$text =~ s#<a([^>]*?) title=(["'])*.*?\2 ([^>]*?)>#<a$1 $3>#gi;
$text =~ s#<a([^>]*?) href=(["']*)<a href=["']*(.*?)["']* target=["']*_new["']*>.*?</a>["']*([^>]*?)>(.*?)</a>
#<a href=$2$3$2$4$1>$5</a>#gi;

# Patch # 5, fix images
# maybe modify # 3 and remove $home(.*?) ? I'm not sure why they're separated
$text =~ s#<img([^>]*)src=["']*\s*<a[^>]*[ ]href=["']*\s*([^'"]*)["']*[^>]*>([^<]*)</a>['"]*([^>]*)>
#<img src="$2"$1$4>#gi;

# Patch # 6, I occasionally see https://http// and I'm not sure why
$text =~ s#https://http:*//#https://#gi;

return $text;
}


Any thoughts on how I could improve those 6 patches, or any other ways I could make this simpler / faster / better?

lucy24

6:06 am on May 31, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm reminded of a bit of boilerplate that we used to have to drag out fairly often in the Apache subforum, about how you should start by explaining in English what you're trying to do, rather than jump directly into the code. What's the input you are working with, and what's the desired output?

lookbehind is fixed-length, so I have to use 2 lines to get both http and https
Is this really necessary? How often in actual practice will ://www\. be preceded by anything other than http or https? In any case can't you use pipes? The RegEx engine I spend most time with will happily accept (?<=http://|https://). Sometimes it's simpler just to capture: a simple (https?://) will save you several bytes.

csdude55

12:18 am on Jun 1, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Grrr, I spent 20 minutes typing up a reply, and somehow closed the tab and lost it all >:-( Didn't it used to pop up with an alert so you wouldn't do that?

Anyway...

I'm reminded of a bit of boilerplate that we used to have to drag out fairly often in the Apache subforum, about how you should start by explaining in English what you're trying to do, rather than jump directly into the code. What's the input you are working with, and what's the desired output?

Haha, Lucy, that's what I get for posting at 1am... I thought it WAS in plain English! LOL

I have user-submitted text (saved as $text), and I'm converting apparent links within it to <a href=...>...</a>. So this:

Have you been to www.example.com>

becomes:

Have you been to <a href='http://www.example.com' target='_blank'>www.example.com</a>?

In addition, if the link is to an image on my site then I convert it to <img src=...> instead of the <a href...>, but if it's linking off-site then I stick with the <a href...>

I wrote the original code sometime in the early 2000s, and over the years I've had glitches pop up here and there... which is why I made patches 4-6; it was easier to fix the problem after the fact instead of figuring out what caused it in the first place. But now I'm rebuilding everything, so this is the time to try to fix the source instead of patching it :-)

Is this really necessary? How often in actual practice will ://www\. be preceded by anything other than http or https? In any case can't you use pipes? The RegEx engine I spend most time with will happily accept (?<=http://|https://). Sometimes it's simpler just to capture: a simple (https?://) will save you several bytes.

This section is where I add the protocol to www.example.com. I use the lookbehind to see if it doesn't begin with http:// or [,...] and if not then I'll plug in [....]

I remember trying this:

(?<!https*://)


I don't remember exactly what happened, it's just in my notes that "it didn't work". I remember researching and finding where someone had said that lookbehind was fixed-length, though, so you can't use the * like that.

I don't recall whether I tried using (?<!http://|https://), though, that might be a good fix for that :-)

In your second example, what does the ? in (https?://) do? I've used ?: to prevent an inner set of parentheses non-capturing, but I'm not following the purpose in this one.

lucy24

1:45 am on Jun 1, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't remember exactly what happened, it's just in my notes that "it didn't work".
Sometimes we would then read you the riot act for simply saying “It doesn’t work” without further explanation ;) but here I am fully prepared to believe it, because lookbehinds have to be of fixed length. I'm not sure if this is universally true of all RegEx engines, but it applies to all that I've used. (We won't talk about javascript, which doesn't support lookbehinds at all. Ugh.)

what does the ? in (https?://) do?
It makes the s optional. It's what you would have said in the lookbehind if you were allowed to do so, darn it. You may have been led astray by the ?: sequence, since (?:blahblah) would have special meaning, but in all other environments a colon is just a colon--here, the literal colon in the full URL.

So let's reconstruct. In the beginning you had the simple matter of meeting
http://www.blahblah in posts, so you dutifully converted
(https?://\S+)
into
<a href = "$1" target = "_blank">$1</a>

And then you realized that some of those posts happened to have sentence-final punctuation immediately afterward, leading to bad links, so it became
(https?://\S+\w)
>>
<a href = "$1" target = "_blank">$1</a>

And then it turned out some people weren't saying the http:// part, which is where the lookbehind comes in, because thanks to that pesky SSL you can no longer postulate http, so now for the others you have to go through a second loop of
(?<!://)(www\.\S+\w)
>>
<a href = "http://$1" target = "_blank">$1</a>
(and if it's really an https site it will just jolly well have to redirect, because you can't be expected to do everything) ... except for the sitenames that don't start in www, which is another loop unless you decide, reasonably enough, that there's a limit to auto-linking and you really can't be bothered ...

And THEN you remembered about links to images on your site, meaning that all of the above have to be preceded by
(https?://)?((?:www\.)?example\.com/)([\w-/]+\.(jpg|gif|png))
>>
<img src = /"$3">

which would then be followed by code for ordinary links to pages on your own site, again stripped down to leading / and .... Oh, lord, we haven't even got through all your patches yet, have we. Which ones did I overlook?

We begin to understand why it's so tempting to use prefab forum/discussion software.

NickMNS

2:58 am on Jun 1, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just finished working on writing a bunch of regex, it is super annoying its like coding in Egyptian hieroglyphs. I found this site [regex101.com...] which is an online regex IDE. So you can test your code as you go. I found it super helpful. There is also explanations describing in English what the typed command does.