Russian site has ripped off content

Forum Moderators: not2easy

Message Too Old, No Replies

Russian site has ripped off content

stuartmcdonald

2:12 pm on Dec 6, 2008 (gmt 0)

A Russian website appears to be in the process of cloning a site of mine (absolutely without my permission!).

They're not running adds on it, and it looks like they've downloaded the site and are in the process of translating it.

Problem (one of many) is, as it stands, there's a lot of duplicate content and meta tags (between my site and what they've ripped but not yet translated).

There's no contact details and the email address on the whois record gets no reply.

Suggestions on approaches?

pageoneresults

3:39 pm on Dec 6, 2008 (gmt 0)

Suggestions on approaches?

DMCA is probably the best option you have. Or, you may want to make a trip to Russia and drop in on them. Be sure to bring some friends. ;)

GaryK

3:51 pm on Dec 6, 2008 (gmt 0)

Not that it would be right, but if the site is translated to a different language will there still be a duplicate content problem?

Where is the site hosted?

jdMorgan

4:36 pm on Dec 6, 2008 (gmt 0)

1) Block the IP range they used to scrape your site.
2) Block the IP ranges of known proxies that could be used to scrape your site.
3) Block all accesses to your site from known scraper user-agents.
4) Block all accesses to your site which are 'defective' in some way, e.g. malformed user-agent, bad or missing HTTP request headers, invalid referrers, etc.
5) Install anti-scraping scripts to prevent scrapes that get past these first four hurdles.
6) Wait until they are 90% done translating your site, and then submit a DMCA claim to the top search engines.
7) Wait until they are removed from the top search engines and 100% done translating your site, and then issue a 'major upgrade' to your site's contents (They then face starting all over again).

An ounce of prevention is worth a pound of cure.

Jim

purplecape

8:52 pm on Dec 6, 2008 (gmt 0)

If the content is translated, neither you nor they will face a duplicate content penalty--there won't be anything "duplicate" to trigger it (meta tags etc. will not).

stuartmcdonald

9:55 pm on Dec 6, 2008 (gmt 0)

Thanks for the replies.

They've only translated perhaps 50% of the content -- the remainder remains all in English -- including all the meta tags.

Site is hosted in Russia.

JdMorgan -- that's a great list. We're actually midway through a redesign ourselves -- would you say it's still worth filing a DMCA once we've launched the redesign?

Thanks again!

Rosalind

10:05 pm on Dec 6, 2008 (gmt 0)

Also, look into the Noarchive initiative. If you ban archive.org and prevent search engines from holding a cached copy of your content, it means any wrongdoers will have to come directly to your site to do their scraping. This means that if you successfully ban them they don't have anywhere else to go to download your content.

shadeofgray

10:00 am on Dec 8, 2008 (gmt 0)

They've only translated perhaps 50% of the content -- the remainder remains all in English -- including all the meta tags.

I recommend you submut DCMA claim to the SE now. Now you have evidence that it is YOUR content - what will you have several months later, when they'll finish translation? I doubt that G or someone else will try to compare Russian and English texts to see whether difference exists or not.

Angonasec

1:50 am on Dec 23, 2008 (gmt 0)

Jim:

Assuming Tabke has targeted you for a well-deserved, substantial, Christmas Bonus, how about a Christmas prezzie to us all?

Would you care to create a basic starter-pack sticky post to cover these bases?

Q/
2) Block the IP ranges of known proxies that could be used to scrape your site.
3) Block all accesses to your site from known scraper user-agents.
4) Block all accesses to your site which are 'defective' in some way, e.g. malformed user-agent, bad or missing HTTP request headers, invalid referrers, etc.
/Q

At users own risk of course.

>And there is a risk.<

I find myself blocking more and more Ru, Ph, In, and Cn, IP ranges, but tripped up when I added Wget to my htaccess blocks, and thereby inadvertently blocked Yahoo's feed bot, which uses Wget to gather our daily rss feed.

I also found this proxy blocking htaccess code elsewhere on the web.
But haven't tried it in case it also blocks the major bots.

RewriteEngine on
RewriteCond %{HTTP:VIA} !^$ [OR]
RewriteCond %{HTTP:FORWARDED} !^$ [OR]
RewriteCond %{HTTP:USERAGENT_VIA} !^$ [OR]
RewriteCond %{HTTP:X_FORWARDED_FOR} !^$ [OR]
RewriteCond %{HTTP:PROXY_CONNECTION} !^$ [OR]
RewriteCond %{HTTP:XPROXY_CONNECTION} !^$ [OR]
RewriteCond %{HTTP:HTTP_PC_REMOTE_ADDR} !^$ [OR]
RewriteCond %{HTTP:HTTP_CLIENT_IP} !^$
RewriteRule ^(.*)$ - [F]

Does it get the jdM seal of approval?

I haven't even begun to block defective calls yet, far too dangerous for a novice, but I do see them.

For example these suspicious hits from a US University. It visits daily with the same error:

We've never had a file called index.html (or index.anything) on our site, instead we have in htaccess:

DirectoryIndex examplefilename.htm

But this visitor never picks up on that, it just keeps doing this daily...

Access log:
134.#*$!.xx.xx - - [20/Dec/2008:12:10:29 -0500] "GET /index.html HTTP/1.1" 404 641 "-" "-"

Error log:
[Sat Dec 20 12:10:29 2008] [error] [client 134.#*$!.xx.xx] File does not exist: /path/www/index.html

Access:
134.#*$!.xx.xx - - [21/Dec/2008:08:58:43 -0500] "GET / HTTP/1.1" 400 393 "-" "-"

Error:
[Sun Dec 21 08:58:43 2008] [error] [client 134.#*$!.xx.xx] ] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /

Finally, concerning:

Q/
5) Install anti-scraping scripts to prevent scrapes that get past these first four hurdles.
/Q

I'd like to install the Perl script you recommend, but since it involves the use of a one pixel image trap, I'm wary that will throw a flag with Google, so haven't done so. Our site still hasn't recovered from "Florida", so I don't want to send bad signals. Am I being too cautious?