Forum Moderators: open
You can't check for the User-Agent strings and add all the IPs with the right User-Agent to a table, because you'd have the same problems User-Agent had. You could get a list of IPs from some website, but how does that website figure out what GoogleBot's IP addresses are?
I'm speaking at the Pubcon later this month on the topic of identifying search engine spiders. There are a number of techniques used to do it. Personally, my favorite is the semi-automated approach, where I have a CGI script that logs visits by users that behave in a certain way (like no HTTP_REFERER header, visiting pages rarely visited by humans, several requests in a short time, requests for html files but no images/css) and the script emails me the daily results of its logs. I then run the logs through another CGI script which tells me if any of the IP addresses are already in my lists. Then, I research all of the entries that are not in the lists.
Dan
Volatilegx that seems like a clever system you've got set up. Have you identified many sneaky spiders that way?
Actually, I think it is unnecessary for them to do so. Google, for instance, could use its Google Accelerator data instead of stealth spiders.