CCBot

Forum Moderators: open

Message Too Old, No Replies

CCBot

another Nutch Variant

Ocean10000

1:50 pm on Mar 27, 2008 (gmt 0)

CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

Just seen this bot come by today, obeyed Robots.txt enough to take it read it and leave without problem. The link in the User-Agent doesn't work but they do have a FAQ page which covers the basics. They are currently claiming they are trying to get 501(c)3 status as a Non-Profit "CommonCrawl Foundation". They are using the same line of creating a new wave of search engine but do not fully state anything other then that. After reading there FAQ I found that this web crawler is based off of Nutch crawler (this is noted in the FAQ).

Current Range Listed on the website for this bot is.
38.103.63.16 through 38.103.63.18

keyplyr

11:46 pm on Mar 27, 2008 (gmt 0)

I just ban "nutch" across the board.

IMO there are just way too many of these, most of which disobey robots.txt. If these companies end up succeeding with their business models and they also contribute to mine, then I'll reevaluate them on a case by case basis once they start using their own unique UA.

incrediBILL

11:47 pm on Mar 27, 2008 (gmt 0)

This is pretty bleeding edge find as I just saw it hit on 3/26 for the first time and return today with no other past bot activity tracked for that IP range.

38.103.63.* "CCBot/1.0 (+http://www.commoncrawl.org/bot.html)"

incrediBILL

11:51 pm on Mar 27, 2008 (gmt 0)

I just ban "nutch" across the board.

That won't help you in this case since it uses "ccbot" as it's robot name.

Now you know why I whitelist ;)

OK, actually the crawlers using gibberish strings like "qwewyeowueouweoiu wieuwoie" caused me to whitelist but nutch and heritrix variations would've pushed me in the same direction inevitably so the end result would've been the same.

keyplyr

1:32 am on Mar 28, 2008 (gmt 0)

I just ban "nutch" across the board.

That won't help you in this case since it uses "ccbot" as it's robot name.

Well, I didn't know the UA string since the OP didn't post it. The thread title is:

CCBot
another Nutch Variant

...so I assumed it had "nutch" somewhere in the UA string. As it turns out, I have "ccbot" banned already, so it can't be that "bleeding edge." LOL

I agree that white listing is the way to go for some sites. I do it on a small scale for a dozen UAs. At some point I may expand white listing across the board to see how well it works for me.

blend27

11:56 am on Mar 28, 2008 (gmt 0)

isn't that from PSI range(38.0.0.0/8) anyway?

Ocean10000

1:39 pm on Mar 28, 2008 (gmt 0)

Yes it is part of that range. And is one of the things that tripped my bot filter so I would even notice it in the reports.

CCBot

another Nutch Variant

Ocean10000

keyplyr

incrediBILL

incrediBILL

keyplyr

blend27

Ocean10000

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week