Forum Moderators: not2easy
They're not running adds on it, and it looks like they've downloaded the site and are in the process of translating it.
Problem (one of many) is, as it stands, there's a lot of duplicate content and meta tags (between my site and what they've ripped but not yet translated).
There's no contact details and the email address on the whois record gets no reply.
Suggestions on approaches?
An ounce of prevention is worth a pound of cure.
Jim
They've only translated perhaps 50% of the content -- the remainder remains all in English -- including all the meta tags.
Site is hosted in Russia.
JdMorgan -- that's a great list. We're actually midway through a redesign ourselves -- would you say it's still worth filing a DMCA once we've launched the redesign?
Thanks again!
They've only translated perhaps 50% of the content -- the remainder remains all in English -- including all the meta tags.
Assuming Tabke has targeted you for a well-deserved, substantial, Christmas Bonus, how about a Christmas prezzie to us all?
Would you care to create a basic starter-pack sticky post to cover these bases?
Q/
2) Block the IP ranges of known proxies that could be used to scrape your site.
3) Block all accesses to your site from known scraper user-agents.
4) Block all accesses to your site which are 'defective' in some way, e.g. malformed user-agent, bad or missing HTTP request headers, invalid referrers, etc.
/Q
At users own risk of course.
>And there is a risk.<
I find myself blocking more and more Ru, Ph, In, and Cn, IP ranges, but tripped up when I added Wget to my htaccess blocks, and thereby inadvertently blocked Yahoo's feed bot, which uses Wget to gather our daily rss feed.
I also found this proxy blocking htaccess code elsewhere on the web.
But haven't tried it in case it also blocks the major bots.
RewriteEngine on
RewriteCond %{HTTP:VIA} !^$ [OR]
RewriteCond %{HTTP:FORWARDED} !^$ [OR]
RewriteCond %{HTTP:USERAGENT_VIA} !^$ [OR]
RewriteCond %{HTTP:X_FORWARDED_FOR} !^$ [OR]
RewriteCond %{HTTP:PROXY_CONNECTION} !^$ [OR]
RewriteCond %{HTTP:XPROXY_CONNECTION} !^$ [OR]
RewriteCond %{HTTP:HTTP_PC_REMOTE_ADDR} !^$ [OR]
RewriteCond %{HTTP:HTTP_CLIENT_IP} !^$
RewriteRule ^(.*)$ - [F]
Does it get the jdM seal of approval?
I haven't even begun to block defective calls yet, far too dangerous for a novice, but I do see them.
For example these suspicious hits from a US University. It visits daily with the same error:
We've never had a file called index.html (or index.anything) on our site, instead we have in htaccess:
DirectoryIndex examplefilename.htm
But this visitor never picks up on that, it just keeps doing this daily...
Access log:
134.#*$!.xx.xx - - [20/Dec/2008:12:10:29 -0500] "GET /index.html HTTP/1.1" 404 641 "-" "-"
Error log:
[Sat Dec 20 12:10:29 2008] [error] [client 134.#*$!.xx.xx] File does not exist: /path/www/index.html
Access:
134.#*$!.xx.xx - - [21/Dec/2008:08:58:43 -0500] "GET / HTTP/1.1" 400 393 "-" "-"
Error:
[Sun Dec 21 08:58:43 2008] [error] [client 134.#*$!.xx.xx] ] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /
Finally, concerning:
Q/
5) Install anti-scraping scripts to prevent scrapes that get past these first four hurdles.
/Q
I'd like to install the Perl script you recommend, but since it involves the use of a one pixel image trap, I'm wary that will throw a flag with Google, so haven't done so. Our site still hasn't recovered from "Florida", so I don't want to send bad signals. Am I being too cautious?