Thanks for bearing with me. Trouble is, I need ALL the all data between the <a> and </a> tags to be extracted, not just the address. For me, why feeding
<a href="/reviews/hardware/e1405.ars">Dell Inspiron e1405 laptop</a>
would give
(a, href, "/reviews/hardware/e1405.ars")
is a real stumper. Where is the title? So, I figure I'm either using the wrong class or I have an option disabled.
Is there any way to get it to give me the raw <a> </a> block?
If I give some utility a raw block, can I get out all of its features, no matter how arbitrary?
Another example:
<a href="../index.html"><img width="425" height="50" border="0"
src="../i/perldoc_banner.gif" alt="Welcome to Perldoc.com"></a>
gives
(a, href, "..index.html")
(img, src, "../i/perldoc_banner.gif")
...2 links? And nothing for the 'alt' attribute? That's way off, for me anyway. I don't see how to programmatically guarantee, just looking at those two links, that they are in fact the same link as seen in a web browser. I don't want to just arbitrarily associate a link with the nearest image, though I absolutely want to account for image links.
Is there a way to just get a dictionary (key-value pairs) of everything between the A tags?
Anyway, Thanks alot for reading all this. I'm hoping I'm just missing something, since I feel "almost there". It would be really sweet to leverage all this latent perlness and move on, without...well, you, know...reinventing the...something... :)
I'm not perl person so you may have to adapt this a little, but here goes.
The following piece of PHP seems to do what I think you want:
$pattern = "/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+¦.*?)?<\/a>/";
preg_match_all($pattern, file_get_contents("http://www.google.com"), $matches);
var_dump($matches[1],$matches[2]);
Andrew
<a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12></a>
I am really hoping this tool already exists, because it is a perfect illustration of the kind of hair-splitting I really don't want to be involved in. Every time I come up with a possible rule for my string scanner I find an exception to that rule. Why can't html just be easy to dissect?
I understand it can be annoying to try and find ways of adapting to other peoples 'eccentric' use of HTML, but as far as I can see the only way of stopping the following RegEx from finding every a/href in a web page is to code it so badly that no browser could parse it.
$pattern = "/<a[\s]+[^>]*?href[\s]?\=[\s\"\']*([\w:?=@&\/#._;-]+)[\s\"\']*.*?\>([^<]+¦.*?)?<\/a>/";
preg_match_all($pattern, file_get_contents("http://www.google.com"), $matches);
echo htmlentities(print_r($matches,true));
I also found this link that may be useful to you. [webmasterworld.com ]
Andrew
<a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a> into
(a,
(href, http://images.google.com/imghp?hl=en),
(img, (src, /intl/en_ALL/images/images_res.gif), (width, 150), (height, 58), (alt, "Go to Google Image Search Home"), (border, 0), (vspace, 12)),
Image Home) ?
#!/usr/bin/perl
use strict;
use warnings;
use HTML::LinkExtractor;
use LWP::Simple qw(get);# get a page to test
my $page = shift ¦¦ "search.cpan.org/recent";
my $html = get("http://$page");# setup the parser
my $LX = new HTML::LinkExtractor();
$LX->strip(1); # just anchor text, not entire tag
$LX->parse(\$html);# print anchor text and href
for my $Link (@{$LX->links}) {
my $tag = $$Link{tag};
# only regular links
next unless $tag eq 'a';
my $href = $$Link{href};
my $text = $$Link{_TEXT};
print $text, " -> ", $href, "\n";
}undef $LX;
In response to your question, no, my script will turn:
<a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a> into
Array
(
[0] => Array
(
[0] => <a href=http://images.google.com/imghp?hl=en><img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home</a>
)[1] => Array
(
[0] => [images.google.com...]
)[2] => Array
(
[0] => <img src=/intl/en_ALL/images/images_res.gif width=150 height=58 alt="Go to Google Image Search Home" border=0 vspace=12>Image Home
))
The first array contains the full match, the second the url and the third the string from in between the anchor tags.
Hopefuly Mechanize will do what you need it to, I could continue to develop the RegEx, but the longer it gets the slower it gets and you may find a Perl Extension faster.
Andrew
Thanks again but I needed the attributes from within the tags. Getting the tags is the easy part.