then I want a regex to modify every single link(<a href ) in it.
its like
<a href="www.mysite.com/mining.cgi?http://www.link.com"
and image links should remain the same if it is complete link and if it is like img src="bg.jpg" then append to it the whole link like src = [somesite.com...]
I need the regex.
$string =~ s/href="([^"]+)"/work_on_link($1)/egis;
$string =~ s/src="([^"]+)"/work_on_image($1)/egis;
sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url = 'http://www.example.com/images/' . $url;
}
return 'src="' . $url . '"';
}
sub work_on_link {
my $url = shift;
if($url =~ m!^https?://!)
{
$url = 'http://www.example.com/mining.cgi?' . $url;
}
return 'href="' . $url . '"';
}
might want to check wether you can use regexps easyily (eg not too many cases you have to check not to mess things up, <link href=""> etc ...), otherwise you'd have to use a tag-parser and iterate through the document-tree.
instead of
href="http://......";
if it does not have double quotes around it..
Also if url is like
href=text.html
href="text.html" or like
href="/text.html" or like
href="../text.html" or like
href="./text.html";
same with src tag too!
I have got through only these cases but there may be several other forms in which the href is written.
I know there will be some simple regular expression which deals with every href case.
i have the code in here...
in this code
$url1 = $FORM{'URL'};
example : [yahoo.com...]
$html =~ s/href="([^"]+)"/work_on_link($1)/egis;
$html =~ s/src\s*=\s*"([^"]+)"/work_on_image($1)/egis;
sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url =~ s/^\\//;
$url = $url1 .'/'. $url;
}
return 'src="' . $url . '"';
}
sub work_on_link {
my $url = shift;
if($url =~ m!^http?://!)
{
$url = 'http://www.someurl.cgi?' . $url;
}
elsif($url =~ m!^(/)?://!)
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
else
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
return 'href="' . $url . '"';
}
#!/usr/bin/perl -w
use strict;
use URI;my $string = join("", <DATA>);
print $string;
my $baseurl = 'http://www.example.com/adirectory/';
$string =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;print $string;
sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.exampl.com/rewrite.cgi?' . $u3 . '"';
}__DATA__
<a href="http://www.example.com">this is a link</a>
<a href="/mydir/">this is a link</a>
<a href="./file.htm">this is a link</a>
<a href="../anotherfile.htm">this is a link</a><a href=http://www.example.com>this is a link</a>
<a href=/mydir/>this is a link</a>
<a href=./file.htm>this is a link</a>
<a href=../anotherfile.htm>this is a link</a><img src="http://www.example.com">
<img src="/mydir/">
<img src="./file.gif">
<img src="../anotherfile.gif"><img src=http://www.example.com>
<img src=/mydir/>
<img src=./file.gif>
<img src=../anotherfile.gif>
Here is what I am doing I am downloading the page source using
system ( "/usr/bin/lynx -source '$QUERY' > link.html");
then open the html file and using regex to change and display the webpage.
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");
.......
.......
......
print($html);
How can I implement the code provided by you.( I am really not much familiar with perl )
#!/usr/bin/perl -w
use strict;
use URI;
my $QUERY = 'http://www.example.com/';system ( "/usr/bin/lynx -source '$QUERY' > link.html");
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");my $baseurl = $QUERY;
$html =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;print $html;
sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.example.com/rewrite.cgi?' . $u3 . '"';
}
should get you started. happy new year.
if there is a href like href="home.html"
it changes it to
[myproxy.com...]
but if there is a href like href="#home"
it changes it to
[myproxy.com...]
and from there on I start getting problems in links.
I need(have) to modify(href with #) it in order make my proxy work correctly.
its not the javascript actually,, its the java applet. But its not to worry about. If it doesn't work it will be fine but I need to make the proxy work for normal html links.
Thanks for the help dude! this forum has really helped me alot!