i have a perl script that uses regular expressions to parse a webpage and return results from the page. there are 20 items that i am parsing for. all was going fine except one was not playing nice and showing the value for the very first item that i was searching for. eventually, i realized that the items on the page had changed order and that was the problem. after fixing the order problem, all 20 values showed up correctly.
here is the problem. this page will change the order of the items in my list on a regular basis. i need a way to keep using just one script and returning all of the values without any extra cruft messages. here is a sample of the script i am using for weather graphing. (graphing is a side item, the script just returns values, just like the script i am having trouble with)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
my $httpaddr = "http://www.aws.com/aws_2001/asp/obsForecast.asp?id=WISHT";
my %data;
my %trash;
my $content = LWP::Simple::get($httpaddr) or die "Couldn't get it!";
# regex in html source order
if ($content =~ /(<b>Temperature<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(-?\d+\.\d+)<\/b>/g) { $data{Temp} = $1; }
if ($content =~ /(<b>Humidity<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(\d+\.\d+)<\/b>/g) { $data{Humidity} = $1; }
if ($content =~ /(<b>Wind<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /(\d+\.\d+)<\/b>/g) { $data{Wind} = $1; }
if ($content =~ /(<b>Daily Rain<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(\d+\.\d+)<\/b>/g) { $data{Rain} = $1; }
if ($content =~ /(<b>Pressure<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(\d+\.\d+)<\/b>/g) { $data{Pressure} = $1; }
if ($content =~ /(HEAT INDEX¦WIND CHILL)/g) { $trash{a} = $1; }
if ($content =~ /(\d+\.\d+)/g) { $data{HeatIndex} = $1; }
if ($content =~ /(DEW POINT:)/g) { $trash{a} = $1; }
if ($content =~ /(\d+\.\d+)/g) { $data{DewPoint} = $1; }
for (keys %data) {
printf "%s:%s ", $_, $data{$_};
}
print "\n";
all of these values stay in the same order all the time, they just change the values returned as the weather changes.
i have tried adding a
my %data;
my %trash;
my $content = LWP::Simple::get($httpaddr) or die "Couldn't get it!";
any help is greatly appreciated. thanks!
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;my $httpaddr = "http://www.example.com";
my %data;
my %trash;
my $content = get($httpaddr) or die "Couldn't get it!";$content =~ s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs;
$content =~ s/\s+/ /gs;
$content =~ s/&[a-zA-Z]{3,4};//gs;if ($content =~ /(Temperature).+?(-?\d+\.\d+)/) {
$trash{a} = $1;
$data{Temp} = $2;
}
if ($content =~ /(Humidity).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Humidity} = $2;
}if ($content =~ /(Wind).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Wind} = $2;
}if ($content =~ /(Daily Rain).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Rain} = $2;
}if ($content =~ /(Pressure).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Pressure} = $2;
}if ($content =~ /(HEAT INDEX¦WIND CHILL).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{HeatIndex} = $2;
}if ($content =~ /(DEW POINT:).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{DewPoint} = $2;
}for (keys %data) {
printf "%s:%s ", $_, $data{$_};
}
print "\n";
although I don't understand what %trash is being used for. The regexp for stripping HTML code is crude but seems to work OK in this case. When I run it against the URL you posted I get something like this printed out:
DewPoint:30.0 Humidity:40.5 Temp:53.4 Wind:2.2 Pressure:30.03 Rain:0.00 HeatIndex:53.4
if there is no way to do what im wanting to do, just say it so that i can move along with friggen 20 seperate scripts that pipe to a file and then get read to complete the primary goal. however, that is not my preferred method as it is sloppy and rigged.
[edited by: Diceman at 10:11 pm (utc) on Oct. 16, 2006]
Temperature 53.0
as long as those two bits of data are in sequence on the page the order that you parse the page will not matter since each regexp is searching through the entire document/variable until it finds the first correct match.
are you saying the numbers would have to be right next to the text with no other tags in between, because on most pages, this is not the case.
No. The code I posted removes all the html tags. Or at least it tries to. You are hopefully left with one string. That one string is parsed for the matching patterns. There could still be something left between the two related bits of data though. Spaces or just text for example. That has to be taken into account. Try this:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;my $httpaddr = "http://www.your-url-here.com";
my %data;
my %trash;
my $content = get($httpaddr) or die "Couldn't get it!";$content =~ s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs; # removes html tags
$content =~ s/\s+/ /gs; # collapses multiple spaces to one space
$content =~ s/&#?[a-zA-Z0-9]{3,6};//gs; # removes ASCII entitiesprint $content;
and see what $content looks like. Note that this character '¦' should be a pipe, the character above the backslash '\' on the keyboard. You will need to repalce that in the above code because this forum changes that character when it's posted here to the split pipe.
if ($content =~!<b>Temperature</b>.*?<b>(-?\d+\.\d+)</b>!s) { $data{Temp} = $1; } pos($content) = undef;