Regex to convert URLs to links

I've been using this code for several years, but I've had to make patch after patch to keep up with random glitches. I'm hoping maybe a fresh set of eyes can help me see where it could be improved?

# Usage: $text = links($text);

sub links {
 require URI::Find;
 URI::Find->import();

 local($text) = @_;

 ## Remove existing link code 

 # Patch # 1, </a>&nbsp; wasn't being caught so I convert the &nbsp; to a whitespace
 $text =~ s#</a>&nbsp;#</a> #g;

 while ($text =~ m#(<a[^>]* href=(["']).*?\2[^>]*?>(.*?)</a>)#gi) {
  $pattern = $1;
  $repl = $3;
  $pattern = quotemeta($pattern);

  if ($repl =~ /http/i || $pattern =~ /rel=["']nofollow/i) {
   $text =~ s#$pattern#$repl#gsi;
  }
 }

 ## Convert www to http://www

 # Patch # 2, trying to auto-correct www.example.com to http://www.example.com
 # one problem, there's one site that uses a link like
 # http://www.example.com/foo/www.example.com/bar.pdf
 # which becomes
 # http://www.example.com/foo/http://www.example.com/bar.pdf
 # so the link is broken

 # lookbehind is fixed-length, so I have to use 2 lines to get both http and https
 $text =~ s#(?<!http://)www\.([a-z])#http://www\.$1#gi;
 $text =~ s#(?<!https://)www\.([a-z])#https://www\.$1#gi;

 # maybe use a \b word boundary?
 # s#\b(?<!https://)www\.([a-z])#https://www\.$1#gi;

 ## Create Links 
 $finder = URI::Find -> new(
  sub {
   ($uri, $orig_uri) = @_;

   $uri =~ s/&nbsp;*$//;
   $orig_uri =~ s/&nbsp;*$//;

   # Patch # 3, ignore images from my site so I don't end up with 
   # <img src="<a href='blah' target='_new'>blah</a>">
   if ($orig_uri =~ /$home(.*?)\.(jpg|jpeg|png|gif|bmp)$/i &&
     $orig_uri !~ /cache/i) { return $orig_uri; }

   else {
    # Remove utm_, then trailing ? or &
    $uri =~ s#utm_\w+=[^&]+(&(amp;)*)*##gi;
    $uri =~ s#[?&]$##;

    $orig_uri =~ s#utm_\w+=[^&]+(&(amp;)*)*##gi;
    $orig_uri =~ s#[?&]$##;

    # turn long links into something like https://www.example.com/.../bar/
    if (length($orig_uri) > 40) {
     $orig_uri = substr($orig_uri, 0, 27) . '...' . substr($orig_uri, -10);
    }

    return "<a href='$uri' target='_blank'>$orig_uri</a>";
   }
  }
 );
 $finder -> find(\$text);

 # Patch # 4, fix nested links
 $text =~ s#<a([^>]*?) title=(["'])*.*?\2 ([^>]*?)>#<a$1 $3>#gi;
 $text =~ s#<a([^>]*?) href=(["']*)<a href=["']*(.*?)["']* target=["']*_new["']*>.*?</a>["']*([^>]*?)>(.*?)</a>
         #<a href=$2$3$2$4$1>$5</a>#gi;

 # Patch # 5, fix images
 # maybe modify # 3 and remove $home(.*?) ? I'm not sure why they're separated
 $text =~ s#<img([^>]*)src=["']*\s*<a[^>]*[ ]href=["']*\s*([^'"]*)["']*[^>]*>([^<]*)</a>['"]*([^>]*)>
         #<img src="$2"$1$4>#gi;

 # Patch # 6, I occasionally see https://http// and I'm not sure why
 $text =~ s#https://http:*//#https://#gi;

 return $text;
}

Any thoughts on how I could improve those 6 patches, or any other ways I could make this simpler / faster / better?

Regex to convert URLs to links

csdude55

lucy24

csdude55

lucy24

NickMNS

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week