Forum Moderators: open
Just seen this bot come by today, obeyed Robots.txt enough to take it read it and leave without problem. The link in the User-Agent doesn't work but they do have a FAQ page which covers the basics. They are currently claiming they are trying to get 501(c)3 status as a Non-Profit "CommonCrawl Foundation". They are using the same line of creating a new wave of search engine but do not fully state anything other then that. After reading there FAQ I found that this web crawler is based off of Nutch crawler (this is noted in the FAQ).
Current Range Listed on the website for this bot is.
38.103.63.16 through 38.103.63.18
IMO there are just way too many of these, most of which disobey robots.txt. If these companies end up succeeding with their business models and they also contribute to mine, then I'll reevaluate them on a case by case basis once they start using their own unique UA.
I just ban "nutch" across the board.
That won't help you in this case since it uses "ccbot" as it's robot name.
Now you know why I whitelist ;)
OK, actually the crawlers using gibberish strings like "qwewyeowueouweoiu wieuwoie" caused me to whitelist but nutch and heritrix variations would've pushed me in the same direction inevitably so the end result would've been the same.
I just ban "nutch" across the board.That won't help you in this case since it uses "ccbot" as it's robot name.
Well, I didn't know the UA string since the OP didn't post it. The thread title is:
CCBot
another Nutch Variant
I agree that white listing is the way to go for some sites. I do it on a small scale for a dozen UAs. At some point I may expand white listing across the board to see how well it works for me.