Simple Search Algorithm

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Simple Search Algorithm

For use on a website

Dabrowski

6:09 pm on Sep 25, 2007 (gmt 0)

I've almost finished my latest project, but need to add a search box. Now, the site's not very big so I've basically come up with words/phrases relevant to each URL.

Not sure the best way to detect matches with what the user typed in, I've got an idea but it's quite clunky, anyone done this before?

perl_diver

11:48 pm on Sep 25, 2007 (gmt 0)

there are plenty of free site-search scripts written you could look at.

Dabrowski

5:00 pm on Sep 27, 2007 (gmt 0)

That answer is not helpful.

I wish to do it myself, and wanted advice. Anyone can look on Google and pinch someone else's code.

jatar_k

11:40 am on Sep 28, 2007 (gmt 0)

actually perl_diver's advice is solid

>> I've got an idea but it's quite clunky, anyone done this before?

yes, the people who wrote the scripts mentioned before, find them and pull them apart to see how they work. If you find a couple you can cross reference methods

Matt Probert

2:34 pm on Sep 28, 2007 (gmt 0)

Er, so what do you two actually want?

I could advise researching regular expression seraches, and use them in a loop which opens and reads each web page in the directory (it's simple and it works as the basis for a simple search).

O'Reilly publish the Perl cookbook which is great for help with real-word solutions.

But if you just want something quick, then do what Perl Diver sensibly suggested and grab one of the many free scripts and adapt it to your own needs.

Matt

Dabrowski

5:00 pm on Sep 28, 2007 (gmt 0)

What I was after wasn't really code, but methodology. Code I can do. I did have a quick look, the most popular one seemed to be by Matt something, and his own demo page was rubbish, completely useless.

The idea I have so far is this:

1. Read all HTML files on the site.
2. strip the text from meta keywords, title, headers, bullet lists, paragraphs and divs (in that order for relevancy).
3. Create cache file for obvious performance reasons.

It's actually matching and weighting results that I'm not sure about yet. The plan was this:

1. For each chunk of text from above, check each word the user entered, so it would match words not consecutive or in order.
2. Record word index of match.
3. Increase relevance depending on containing tag.
4. Increase relevance for any words in right order.
5. Increase relevance for any consecutive words.

I'm not sure about the relevancy values though, do you think it would work if everything just scored 1?

perl_diver

12:46 am on Sep 29, 2007 (gmt 0)

what you describe is called an index. You are free to create those however you determine best suits your needs. Your plan sounds OK to me.

This is a good link:

[searchtools.com...]

SeanW

8:43 pm on Sep 29, 2007 (gmt 0)

I think the term is "reverse index"

Page 1: Hi there
Page 2: Hello there

Index:

hello: 2
hi: 1
there: 1,2

Unless this is an academic exercise though, it's probably faster to run ht://dig or something, which is skinnable and has its own relevancy algos you can tweak.

Sean

perl_diver

5:16 am on Sep 30, 2007 (gmt 0)

reverse index or inverted index. I agree, there are a number of good applications for creating a searchable index which the OP could use. But if he wants to code his own site search indexer, I can understand that.

Dabrowski

5:48 pm on Oct 2, 2007 (gmt 0)

I like to do things myself, or at least have a go. That way I can understand how it's done. Like I said, I could use someone else's code but it wouldn't give me the same understanding.

Anyways, here's some examples:

Page 1: The quick brown fox jumped over the lazy dog.
Page 2: The lazy dog, was jumped over by the fox, brown and quick.
Page 3: The brown, quick fox avoided the car, the lazy dog died.
Page 4: I hate brussel sprouts, I give them to the quick dog.

ok, the user searches for "quick brown fox":

Page 1 has 3 words +3 right order +3 consecutive.
Page 2 has 3 words.
Page 3 has 3 words.
Page 4 has nothing.

Results show 1, 2, 3.

User searches for "brown fox".

Page 1 scored 2 hits, +1 for right order, +1 for consecutive words.
Page 2 scored 2 hits.
Page 3 scored 2 hits, +1 for right order.
Page 4 0.

Results are 1, 3, 2.

Now try "quick fox lazy dog":

Page 1: 4 words +4 right order, +1 consecutive.
Page 2: 4 words +2 right order, +1 consecutive.
Page 3: 4 words +4 right order, +2 consecutive.
Page 4: 2 words +1 right order.

Results are 3, 1, 2, 4.

Does that look ok? Having actually worked it out on such similar examples it would seem to work ok and give results in the expected order.

SeanW

6:42 pm on Oct 2, 2007 (gmt 0)

Sounds right, as they say, the devil's in the details.

Trying to avoid real work today, so here's my first stab at it. Your pages are stored in files called 1 through 4.

[perl]

#!/usr/bin/perl
use strict;
my $index = {};

use Data::Dumper;

foreach my $f (1..4) {
indexfile($f);
}

#print Dumper $index;
searchindex("brown fox");

sub indexfile {
my $fname = shift;
my $wc = 0;
open A, "<$fname" or return;
while (<A>) {
s/[^a-zA-Z ]//g;
tr/A-Z/a-z/;
my @words = split /\s+/;
foreach my $w (@words) {
push @{$index->{$w}}, ("$fname:$wc");
$wc++;
}
}
}

sub searchindex {
my $phrase = shift;

my @phrasewords = split / /, $phrase;
my $pageresults; # A hash containing the results as we process them

foreach my $w (@phrasewords) {

if ($index->{$w}) { # this word is known to us
foreach my $r (@{$index->{$w}}) {
my ($page, $offset) = split /:/, $r;
$pageresults->{$page}->[$offset]=1;
}
}
}

foreach my $page (keys %$pageresults) {
print $page . " ";
my $total = 0;
my $consecutive = 0; my $cflag = 0;
foreach my $i (@{$pageresults->{$page}}) {
if ($i == 1) {
$total++;
if ($cflag == 1) {
$consecutive++;
} else {
$consecutive=1;
$cflag=1;
}
} else {
$cflag = 0;
}
}
print " = $total total, $consecutive consecutive\n";
}
}
[/perl]

And the results


# perl index.pl
1  = 2 total, 2 consecutive
3  = 2 total, 1 consecutive
2  = 2 total, 2 consecutive

Sean

perl_diver

9:10 pm on Oct 2, 2007 (gmt 0)

I've found order of search terms to be of little relevance over the years, people tend to search in terms of importance or relationship, like:

perl regexp tutorial

shoes black size 13

but there are times when people search for phrases or sayings and order can be relevant.

The search algorithms are actually fairly easy to code, it's coding a good site indexer that allows for fast and accurate results which is tricky and has to be tweaked. Logging searches is a good way to get a feel of what your users will be searching for and enabling you to tweak the indexer and the search script.

Dabrowski

2:58 pm on Oct 3, 2007 (gmt 0)

Thanks for your time SeanW, I'll have a proper look at that when I get bit more time.

Perl_diver, thanks for the suggestion about logging searches, and as for the speed thing, the site is about 60 pages and not dynamix, I'll get it to create an index file so it's only searching 1 file, and that should do it performance wise.

Dabrowski

1:50 am on Oct 11, 2007 (gmt 0)

Unless this is an academic exercise though

It's always an acedemic exercise, I always try and do it myself first. The only thing I gave up on so far is a HTML parser!

ok, I've had a play. So far I've got it to search the entire site for html files, strip the bare text, and note important things like headers and bullets. It then saves this in a cache. I might change this to actually spider the site first instead of reading from disk. I found it picked up a few files I didn't want it to index and had to code around it.

Sean, I like your way of indexing each word, it's better than the way I was going to do it! I would prefer to do this page by page, there doesn't seem to be any problem with this, and it will save having to join and split the indexes, or create more complicated way of storing them.

Incidentally, how did you get your code to appear as code?

SeanW

2:14 am on Oct 11, 2007 (gmt 0)

I would prefer to do this page by page, there doesn't seem to be any problem with this, and it will save having to join and split the indexes, or create more complicated way of storing them.

True, but then the complexity grows in proportion to the number of pages. Besides, the "more complicated" way of storing them only involves switching the 'keyword' => [ "a:b", "c:d" ]
structure to
'keyword' => { pages => [ "a", "c"], offsets => [ "b", "d"] }

Incidentally, how did you get your code to appear as code?

[ perl ]
.....
....
[ /perl ]

(remove spaces)

Dabrowski

2:58 pm on Oct 11, 2007 (gmt 0)

True, but then the complexity grows in proportion to the number of pages

I mean, index the page, then check the results before moving on, so:

[perl]$pageresults->{$page}->[$offset]=1;

becomes

$pageresults->[$offset]=1;[/perl]

and you save a few foreach loops.

Dabrowski

10:44 pm on Oct 14, 2007 (gmt 0)

Hi Sean.

I've coded that up and it works really well, thanks for the indexing code, I would never have thought of doing it that way.

Although now I've written it, I realised I've missed a step, It only scans the search term once, so although it scores for every occurance, it misses multiple right order and consecutive points.

For example, search for "sea shells" in this text:
She sells sea shells on the sea shore, the shells she sells are sea shells for sure.

Man that's even hard to type!

It will score:
sea: 3
shells: 3
right order: 1
consecutive: 1

When it should be 3, 3, 2, 2. Although I think I've found a way of doing that without the intermediate step.

Dabrowski

7:06 pm on Oct 18, 2007 (gmt 0)

ok, I've got it working splendidly, without the intermediate section using the %pageresults bit.

Now, getting slightly more complicated, what about multiple words in the search term, the 'red lorry' method as I've come to call it.

On the text "The red car and the blue car had a race", search for "red car blue car".

Obviously this should score 8, but the double occurance of the word 'car' is a proverbial spanner.

Anyone still watching this thread?

SeanW

7:22 pm on Oct 18, 2007 (gmt 0)

Yup, still here.

How does the search know the string is really "red car .* blue car" and not "red .* car .* blue .* car" or some variant?

Sean

Dabrowski

9:13 pm on Oct 18, 2007 (gmt 0)

It should still recognise that 'car' comes after 'blue' and give it points for order/consecutive as appropriate.

I think I can work that in but I don't know how tidy it's going to be.