Forum Moderators: open

Message Too Old, No Replies

RufusBot

"Selects and repackages" web site pages?

         

bouncybunny

1:27 am on Mar 29, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



UserAgent is; RufusBot (Rufus Web Miner; [webaroo.com...]
And the web page forwards to; [webaroo.com...]

How do people feel about this? As I understand it, this company 'selects' parts of web sites to 'package up' and then repurposes the content for people to view offline.

Sounds awfully much like breach of copyright to me. But is there an advantage to webmasters? Or should it be blocked?

wilderness

10:23 pm on Mar 29, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Web Miner"

Nothing further needs to be read!

bouncybunny

5:04 am on Mar 30, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



403d with extreme predjudice. ;)

keyplyr

2:40 am on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My experience shows that this UA requests and obeys robots.txt.

GaryK

2:58 am on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My experience shows that this UA requests and obeys robots.txt.

Yep, it's very polite while it's stealing your content. ;)

keyplyr

7:24 am on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The point being, it's not necessary to 403 :)

GaryK

2:46 pm on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I get a little odd when I'm under pressure like I am now to get three sites done at once. No offense was intended. I'm sure you know that but I just wanted to be sure. ;)

GaryK

11:38 pm on Apr 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got RufusBot and a new one:

PiyushBot (Piyush Web Miner; [piyush.com...]

both crawling my site from the same IP Address at the same time. Both read robots.txt. Neither one obeyed it.

Another interesting note. If you click on the URL in the UA you'll wind up at a page that looks like a bad spoof of a Network Solutions, website under construction page. Maybe it's the real thing but I really don't think so.

<snip IP address and whois lookup data>

[edited by: volatilegx at 6:53 pm (utc) on April 2, 2007]
[edit reason] removed identifying info [/edit]

keyplyr

4:39 am on Apr 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My proactive rewrite rule blocking Web.?Miner works to stop PiyushBot.

Thanks for the heads-up concerning Rufusbot not obeying robots.txt. I've had it disallowed via robots.txt for a couple years and it's never requested any other files. Guess it's headed for a life of crime.

GaryK

5:25 am on Apr 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think Dan was a bit extreme in terms of the identifying info he deleted. It was the standard stuff we all post. My main point was that Webaroo was specifically mentioned in the WhoIs data for both bots.

volatilegx

8:39 pm on Apr 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Gary, you're probably right. Sorry about that. It's easy to see it's linked to Webaroo by their own info. All I should have done was obfuscate the IP address.

GaryK

10:55 pm on Apr 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Dan. :)

PiyushBot (Piyush Web Miner;*)
RufusBot (Rufus Web Miner;*)
SumeetBot (Sumeet Bot; *)
WebarooBot (Webaroo Bot;*)

These all seem to have a relationship to Webaroo.

So just a heads-up for anyone using "Web.?Miner" to trap Webaroo. The last two in that list don't follow the same pattern.

wilderness

11:00 pm on Apr 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The last two in that list don't follow the same pattern.

They follow this pattern Gary ;)

RewriteCond %{REMOTE_ADDR} ^64\.124\.122\.2(2[4-9]¦[3-5][0-9])$ [OR]

incrediBILL

5:29 pm on Apr 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Today it's called pulseBot, so those of you using robots.txt are just wasting your time.

I've said it before and I'll say it again, robots.txt is for GOOD spiders, firewalls are for all the rest.

Here's my complete list of their bot's User Agent strings:

64.124.122.228 "WebarooBot (Webaroo Bot; [64.124.122.252...]

64.124.122.228 "PiyushBot (Piyush Web Miner; [piyush.com...]

64.124.122.228 "RufusBot (Rufus Web Miner; [webaroo.com...]

64.124.122.228 "RufusBot (Rufus Web Miner; [64.124.122.252...]

64.124.122.228 "SumeetBot (Sumeet Bot; [64.124.122.252...]

64.124.122.228 "PsBot (PsBot; [64.124.122.252...]

64.124.122.228 "pulseBot (pulse Web Miner)"

GaryK

6:25 pm on Apr 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Agreed Don. I have all of Webaroo banned at the firewall level. :)

Thanks for the last two Bill. I hadn't seem them before.

[edited by: GaryK at 6:26 pm (utc) on April 8, 2007]