I was thinking that it would be useful to list the
named distributed crawlers. Some indication of whether they are ever seen inside server farms would also be of value.
I guess the criteria here is that these are bots that use a large number of non-server machines to feed a central index.
Here's a list of some creatures that I've seen that I think
might meet this definition (listed as UA keywords that you could match on):
Synapse
MJ12Bot
Proximic
Genieo
A6Indexer
Pinterest
FunWebProducts
CRAZYWEBCRAWLER
Flipboard
80legs
linkdexbot?
urlappendbot?
This list comes from looking at the number of distinct address ranges of various sizes that have harboured each.
A network of virus-infected machines would be one way to achieve this, as would a browser toolbar or plugin/addin.
I'm not very familiar with the various business models, or whether the items I've listed above even all qualify.
I have a number of other candidates that might be among these, but in smaller numbers, so without a lot of research it's hard to tell.
Anyone here have suggestions/additions/subtractions?
There are certainly stealth distributed crawlers that pretend to be browsers, but they're much harder to recognise:
[
webmasterworld.com...]