Best way to make a whole site searchable?

Forum Moderators: phranque

Message Too Old, No Replies

Best way to make a whole site searchable?

without putting content in a database

pixeltierra

4:47 am on Nov 22, 2006 (gmt 0)

What are the various options for making a site searchable without putting all content in a database?

Is it viable to make an indexing script and just search through that?

Has anyone ever done this? Care to share your story/experience?

bill

5:01 am on Nov 22, 2006 (gmt 0)

I've used wikis that run off of text files, not strictly databases, and they've been searchable.

I'm not really clear what you're looking for here. Are you looking for systems that have search built in, or how to structure your site/pages to make them SE friendly, or something else?

jtara

5:28 am on Nov 22, 2006 (gmt 0)

Any "indexing script" is going to put your content into a database.

One of the nice things about CMSs is that a pretty decent site search typically goes along for the ride - since your content is already in a database.

That said, search requires a different kind of indexing than that which is provided by the typical database. FWIW, MYSQL *does* have full-text indexing capability. Of course, your database will at least double in size when you do full-text indexing.

phranque

5:57 am on Nov 22, 2006 (gmt 0)

you could install a search engine on your site.
i've had some success with ht://Dig (http://htdig.org)

pixeltierra

6:19 am on Nov 22, 2006 (gmt 0)

The reason I don't want my content in a db, is because I don't want to go through the db to edit my content. I use alot of scripting that gest tweaked. I want files that open in code editors.

What I meant by an 'indexing script', is this: a script that is pointed to certain directories that will suck out content from files (text) and stick into a searchable format (either a db or a flat file) while associating the file name. The script will run as a periodic cron job.

For example check this out:

//example_1.php

<?php
echo "<h1>Pancakes</h1>";
?>

//example_2.php

<?php
echo "<h1>Hotdogs</h1>";
?>

These two files would be indexed like follows:

�----content----�-----file---------�
�--Pancakes-----�--example_1.php---�
�--Hotdogs------�--example_2.php---�

The hard part would be getting to the actual content, i.e. stripping the php and html and preserving the content.

So my question was does this seem silly or smart, and has anyone attempted it and/or have any tips?

Matt Probert

10:39 am on Nov 22, 2006 (gmt 0)

Has anyone ever done this?

Yes. <g>

And it works rather well, though I say so myself.

I have a large web site, many pages, much data (all reference), which is searchable by various scripts, a plain "search" script to enable the user to look for "widgets", and research scripts that enable users to retrieve ALL data recorded about "widgets" either site wide or within a restricted scope.

I did encounter issues while developing these searches. The main one being server overloads caused by site suckers and the like running numerous simultaneous searches which took too long.

It should also be noted that its very unlikely any shared hosting service will allow such CPU intensive work to be done, I use a dedicated server.

The scripts are written in Perl, which is designed for extracting text from within text files (in this case HTML pages).

The simplest basic premise is to sequentially open each HTML page in turn, seek the required text string, and if found return the contents of that web page.

From this simple premise one can add sophistication to increase the speed at which results are returned, restrict the returned data to just parts of a web page, &c.

You will find that the web site is a database, and the HTML pages are records, and as such that modifying the basic structure of the web pages can cause the searches to fail (sheepish grin <g>)

Good luck.

Matt

rocknbil

12:38 pm on Nov 22, 2006 (gmt 0)

The reason I don't want my content in a db, is because I don't want to go through the db to edit my content. I use alot of scripting that gest tweaked. I want files that open in code editors.

You still want to store your search data in a database, it's at least 100 times faster than searching text files. You shouldn't have to use the DB to edit your pages.

The speed improvements are twofold. What you do is write a script (or get a canned script and modify it) that indexes the pages once or multiple times a day via a cron job. When it does the indexing it strips out all the HTML and stores the raw content in the DB, associated with the URL of th eoriginating page. Instead of searching through HTML files every time someone searches the site, it does it once, or twice, or however many times you think the search database needs updating.

The other advantage is the speed and flexibility of select statements on a mysql DB have a lot of advantages over regexps. It's also not as hard on the sever, opening and closing plain text files for a search process every time a visitor runs it is pretty hard on a server disk.

If you're good at scripting, you can even develop a few subs to use this clean content for generating meta tags for keywords and descriptions that are truly unique and specific to the pages rather than having to hand code them.

It's a thought, and much more flexible than plain page searches.

pixeltierra

6:27 pm on Nov 22, 2006 (gmt 0)

Rockinbill:

That is exactly what I started this thread for. Check out my explanation in my second post of this thread:

What I meant by an 'indexing script', is this: a script that is pointed to certain directories that will suck out content from files (text) and stick into a searchable format (either a db or a flat file) while associating the file name. The script will run as a periodic cron job.

But specifically I was looking for confirmation that this isn't a silly idea, and for recommendations on existing scripts (I could do it myself if I had to).

So judging by the responses here, it doesn't seem like a silly idea. I'm more of a php guy, but people tell me perl is good at this kind of thing.

jtara

6:39 pm on Nov 22, 2006 (gmt 0)

There's no reason to write your own. There are a number of search engines you can install on your site. htdig, mentioned earlier, is one of them that I have used. All of them are going to use a database A(of some sort) to store the index data. These programs crawl your site, just as Google and Yahoo do, and build a database of the words it finds.

Your insistence on not using a database caused some confusion. It's always useful when asking a question here to state what you are trying to accomplish, not how you think it might be accomplished. :)

pixeltierra

7:27 pm on Nov 22, 2006 (gmt 0)

I didn't mean to cause confusion, but I can see where it comes from now, since I said two seemingly conflicting things.

...without putting all content in a database?

...stick into a searchable format (either a db or a flat file) while associating the file name. The script will run as a periodic cron job.

What I really mean is I don't want the original content (files) to live in a database, but I do want to put the results of the indexing script in a searchable format, probably a database.

Sorry for the confusion.

abbeyvet

7:55 pm on Nov 22, 2006 (gmt 0)

Take a look at zoom search. Works well, though I have not tried it on HUGE sites.

rocknbil

10:25 pm on Nov 22, 2006 (gmt 0)

That is exactly what I started this thread for. Check out my explanation in my second post of this thread:

Bill sez "duh" and sorry, it was about 4 AM when I posted my reply. Yeah that's the way to do it, it's not all that difficult to code either. Once you have raw content in the DB, there's all sorts of cool stuff you can do with the data.

pixeltierra

1:04 am on Nov 23, 2006 (gmt 0)

Rockinbill:

No problem, don't worry about it. I wasn't clear.

It seems like a simple striptags would work pretty well, right?

I'm curious about the other cool things you mentioned that you can do with your content once it's in a database. Do tell...

peterdaly

2:12 am on Nov 23, 2006 (gmt 0)

I've used Nutch to implement site search before. It's not really intended for single site use, but it works just fine.

Fairly techie solution, but it's worked great for me in the past. It may have a steep learning curve depending on your background.