Forum Moderators: open

Message Too Old, No Replies

Named distributed crawler list

         

trintragula

10:13 am on Jan 8, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I was thinking that it would be useful to list the named distributed crawlers. Some indication of whether they are ever seen inside server farms would also be of value.

I guess the criteria here is that these are bots that use a large number of non-server machines to feed a central index.

Here's a list of some creatures that I've seen that I think might meet this definition (listed as UA keywords that you could match on):

Synapse
MJ12Bot
Proximic
Genieo
A6Indexer
Pinterest
FunWebProducts
CRAZYWEBCRAWLER
Flipboard
80legs
linkdexbot?
urlappendbot?

This list comes from looking at the number of distinct address ranges of various sizes that have harboured each.

A network of virus-infected machines would be one way to achieve this, as would a browser toolbar or plugin/addin.

I'm not very familiar with the various business models, or whether the items I've listed above even all qualify.
I have a number of other candidates that might be among these, but in smaller numbers, so without a lot of research it's hard to tell.

Anyone here have suggestions/additions/subtractions?

There are certainly stealth distributed crawlers that pretend to be browsers, but they're much harder to recognise:
[webmasterworld.com...]

aristotle

9:51 pm on Jan 8, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you just make a habit of blocking all unwanted crawlers with UA keywords, then it doesn't matter if they come from multiple IPs or not.

trintragula

10:23 pm on Jan 8, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



That's certainly a practical way to stop them, but I can think of a number of reasons why this list might still be interesting:

1. removing candidates from the list of IPs that should be checked for being a new server farm (server farms may still host these things, but if they're on user's machines, the user machines will likely swamp them by number, so they don't make good leads).

2. collecting a list provides a basis for analysing bot behaviour, which may be useful in the future for catching those distributed crawlers that hide themselves.

3. bot behaviour from an IP range often results in the range being blocked. Bot behaviour from a known distributed crawler can be treated differently, as those ranges will more often than not be full of users.

aristotle

12:50 am on Jan 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So a bot that comes from a lot of different IPs might behave differently in other ways as well, and might also provide other kinds of useful extra information.

lucy24

2:43 am on Jan 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sometimes I meet a robot that is intrinsically unexceptionable (MJ12 comes to mind) but because its crawling is farmed out to so many different ranges, it often ends up getting blocked-by-association.

If a distributed crawler is known to provide a good and useful service, you might tweak your lockouts to poke a hole for them. Conversely, if a crawler is inherently malign it may merit a lockout by name, no matter where it comes from.

Doesn't pinterest have its own IP ranges? I always thought it did :(

trintragula

11:03 am on Jan 9, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hmm, it looks like pinterest may be AWS. I should probably remove that one from the list. Thanks, Lucy.

Any Additions or corrections are appreciated!