Regex Help Needed

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regex Help Needed

Matching <a href=>link</a>

Birdman

11:23 am on Mar 7, 2007 (gmt 0)

Hello,

Does anyone have a regex handy to match whole anchors.

I am trying to create a script that parses text and I need to match all <a href>link</a> tags.

The main problem is that the links are going to be in different formats and have other attributes.

Examples:

<a href='htp://site.com' target='_blank'>link</a>
<a target='_blank' href='htp://site.com' >link</a>
<a href="htp://site.com" target="_blank">link</a>
etc...

I have searched high and low for this but have yet to find what I need.

Thanks in advance!

phranque

11:48 am on Mar 7, 2007 (gmt 0)

this might work:

<[aA] .*<\/[aA]>

might not cover all weirdness and doesn't help when anchor tags are split by newline...

adb64

12:26 pm on Mar 7, 2007 (gmt 0)

Is the solution provided by phranque enough, or do you also need the information for each tag like the link text and all possible attributes?

phranque

1:37 pm on Mar 7, 2007 (gmt 0)

<[aA] .*<\/[aA]>

i'll break it down for you:
< - the pattern starts with a '<'
[aA] - followed by upper or lower case 'a ' and a space
.* - followed by zero or more of any character
< - followed by a '<'
\/ - followed by a '/' (escaped by the '\')
[aA] - followed by upper or lower case 'a '
> - followed by a '>'

in short, "<a (some stuff)</a>"

i just now realized while reviewing this that the ambiguous, greedy and promiscuous nature of using ".*" means if you have two anchors on the same line it will include both sets in one pattern match.

if you have no tags in your anchor text, you could fix this by using:

<[aA] [^<]*<\/[aA]>

the difference is:
[^<]* - this means zero or more characters that are not a '<'

if you want to use this in a loop on found patterns you could try something like this:

while ($yourstring =~ m/(<\s*<a\s[^<]*<\s*\/a\s*>)/igs) {
do something with $1;
}

this pattern should let you get sloppy with case and blank/tab usage.

Birdman

5:08 pm on Mar 7, 2007 (gmt 0)

Thanks for the replies!

I think i left some info out on my first post.

I need to backreference the URL too.

Here's what I'm doing:

It's actually a PHP script using preg_replace_callback() but I posted here since it's a regex question.

I am parsing text rss files and removing links but only after comparing them to an "allowed url" list.

Quick question: Would I be better suited to just use HTML::TokeParser?

phranque

1:38 am on Mar 8, 2007 (gmt 0)

I need to backreference the URL too.

not sure if php uses any different regexp syntax, but
this might do it:

while ($yourstring =~ m/(<\s*a\s[^>]*href\s*=\s*['"]([^'"]*)['"][^<]*<\s*\/a\s*>)/igs) {
do something with $1 if (you want the whole anchor tag);
do something with $2 if (you want the url only);
}

lexipixel

1:43 am on Mar 8, 2007 (gmt 0)

This might help with HTML::TokeParser
It's a snip from a UA I wrote to extract just the base URL's, (e.g.- www.domain.tld from <a href="http://www.domain.tld/somedir/subdir/pagename.xyz?s=x)" param="data" etc="something" BLAH BLAH >

#
#
$html = "(contents of an entire HTML file)"#
#
use HTML::TokeParser;
#
$p = HTML::TokeParser->new(\$html);
#
while (my $token = $p->get_token()) {
#
my $tokenType = $token->[1] �� "";
my $tokenText = $token->[4] �� "";
my $tokenText = lc($tokenText);
#
if (lc($tokenType) eq "a") {
if (( index lc($tokenText), "http://")!= "-1") {
$tokenText =~ s/<a//g;
$tokenText =~ s/href=//g;
$tokenText =~ s/\"//g;
$tokenText =~ s�http://��g;
$tokenText =~ s/\>//g;
@split_text = split (' ',$tokenText);
@split_url = split ('/',$split_text[0]);
#
# compare $split_url[0] to your allowed list
# and do what you like with the data.
#
}
}
}

NOTE: check the WebmasterWorld pipe replacement...

BTW - if the files your want to parse are online (not on your local machine), simply marry up LWP::UserAgent to HTML::TokeParser

Birdman

9:12 pm on Mar 9, 2007 (gmt 0)

Thanks for the replies everyone. Sorry I haven't posted back yet, been super busy.

well, I ended up sticking with the regex, rather than HTML:TokeParser.

Here is what I came up with:

'/<a[^>]*\shref=["\'][^"\'](.*)["\']*\s?>.*?<\/a>/si''/<a[^>]*\shref=["\'][^"\'](.*)["\']*\s?>.*?<\/a>/si'

Pretty lengthy eh? It seems to work well though. If anyone is lookin, I'd appreciate any tips on shortening it if it can be done.

Thanks again!