I could advise researching regular expression seraches, and use them in a loop which opens and reads each web page in the directory (it's simple and it works as the basis for a simple search).
O'Reilly publish the Perl cookbook which is great for help with real-word solutions.
But if you just want something quick, then do what Perl Diver sensibly suggested and grab one of the many free scripts and adapt it to your own needs.
Matt
The idea I have so far is this:
1. Read all HTML files on the site.
2. strip the text from meta keywords, title, headers, bullet lists, paragraphs and divs (in that order for relevancy).
3. Create cache file for obvious performance reasons.
It's actually matching and weighting results that I'm not sure about yet. The plan was this:
1. For each chunk of text from above, check each word the user entered, so it would match words not consecutive or in order.
2. Record word index of match.
3. Increase relevance depending on containing tag.
4. Increase relevance for any words in right order.
5. Increase relevance for any consecutive words.
I'm not sure about the relevancy values though, do you think it would work if everything just scored 1?
This is a good link:
[searchtools.com...]
Anyways, here's some examples:
Page 1: The quick brown fox jumped over the lazy dog.
Page 2: The lazy dog, was jumped over by the fox, brown and quick.
Page 3: The brown, quick fox avoided the car, the lazy dog died.
Page 4: I hate brussel sprouts, I give them to the quick dog.
ok, the user searches for "quick brown fox":
Page 1 has 3 words +3 right order +3 consecutive.
Page 2 has 3 words.
Page 3 has 3 words.
Page 4 has nothing.
Results show 1, 2, 3.
User searches for "brown fox".
Page 1 scored 2 hits, +1 for right order, +1 for consecutive words.
Page 2 scored 2 hits.
Page 3 scored 2 hits, +1 for right order.
Page 4 0.
Results are 1, 3, 2.
Now try "quick fox lazy dog":
Page 1: 4 words +4 right order, +1 consecutive.
Page 2: 4 words +2 right order, +1 consecutive.
Page 3: 4 words +4 right order, +2 consecutive.
Page 4: 2 words +1 right order.
Results are 3, 1, 2, 4.
Does that look ok? Having actually worked it out on such similar examples it would seem to work ok and give results in the expected order.
Trying to avoid real work today, so here's my first stab at it. Your pages are stored in files called 1 through 4.
[perl]
#!/usr/bin/perl
use strict;
my $index = {};
use Data::Dumper;
foreach my $f (1..4) {
indexfile($f);
}
#print Dumper $index;
searchindex("brown fox");
sub indexfile {
my $fname = shift;
my $wc = 0;
open A, "<$fname" or return;
while (<A>) {
s/[^a-zA-Z ]//g;
tr/A-Z/a-z/;
my @words = split /\s+/;
foreach my $w (@words) {
push @{$index->{$w}}, ("$fname:$wc");
$wc++;
}
}
}
sub searchindex {
my $phrase = shift;
my @phrasewords = split / /, $phrase;
my $pageresults; # A hash containing the results as we process them
foreach my $w (@phrasewords) {
if ($index->{$w}) { # this word is known to us
foreach my $r (@{$index->{$w}}) {
my ($page, $offset) = split /:/, $r;
$pageresults->{$page}->[$offset]=1;
}
}
}
foreach my $page (keys %$pageresults) {
print $page . " ";
my $total = 0;
my $consecutive = 0; my $cflag = 0;
foreach my $i (@{$pageresults->{$page}}) {
if ($i == 1) {
$total++;
if ($cflag == 1) {
$consecutive++;
} else {
$consecutive=1;
$cflag=1;
}
} else {
$cflag = 0;
}
}
print " = $total total, $consecutive consecutive\n";
}
}
[/perl]
And the results
# perl index.pl
1 = 2 total, 2 consecutive
3 = 2 total, 1 consecutive
2 = 2 total, 2 consecutive
Sean
perl regexp tutorial
shoes black size 13
but there are times when people search for phrases or sayings and order can be relevant.
The search algorithms are actually fairly easy to code, it's coding a good site indexer that allows for fast and accurate results which is tricky and has to be tweaked. Logging searches is a good way to get a feel of what your users will be searching for and enabling you to tweak the indexer and the search script.
Perl_diver, thanks for the suggestion about logging searches, and as for the speed thing, the site is about 60 pages and not dynamix, I'll get it to create an index file so it's only searching 1 file, and that should do it performance wise.
Unless this is an academic exercise though
It's always an acedemic exercise, I always try and do it myself first. The only thing I gave up on so far is a HTML parser!
ok, I've had a play. So far I've got it to search the entire site for html files, strip the bare text, and note important things like headers and bullets. It then saves this in a cache. I might change this to actually spider the site first instead of reading from disk. I found it picked up a few files I didn't want it to index and had to code around it.
Sean, I like your way of indexing each word, it's better than the way I was going to do it! I would prefer to do this page by page, there doesn't seem to be any problem with this, and it will save having to join and split the indexes, or create more complicated way of storing them.
Incidentally, how did you get your code to appear as code?
I would prefer to do this page by page, there doesn't seem to be any problem with this, and it will save having to join and split the indexes, or create more complicated way of storing them.
True, but then the complexity grows in proportion to the number of pages. Besides, the "more complicated" way of storing them only involves switching the 'keyword' => [ "a:b", "c:d" ]
structure to
'keyword' => { pages => [ "a", "c"], offsets => [ "b", "d"] }
Incidentally, how did you get your code to appear as code?
[ perl ]
.....
....
[ /perl ]
(remove spaces)
I've coded that up and it works really well, thanks for the indexing code, I would never have thought of doing it that way.
Although now I've written it, I realised I've missed a step, It only scans the search term once, so although it scores for every occurance, it misses multiple right order and consecutive points.
For example, search for "sea shells" in this text:
She sells sea shells on the sea shore, the shells she sells are sea shells for sure.
Man that's even hard to type!
It will score:
sea: 3
shells: 3
right order: 1
consecutive: 1
When it should be 3, 3, 2, 2. Although I think I've found a way of doing that without the intermediate step.
Now, getting slightly more complicated, what about multiple words in the search term, the 'red lorry' method as I've come to call it.
On the text "The red car and the blue car had a race", search for "red car blue car".
Obviously this should score 8, but the double occurance of the word 'car' is a proverbial spanner.
Anyone still watching this thread?