Hi, I have problems with scraper sites and my latest site with a database of close to a million entries I am particularity worried about it being scraped. I was thinking of just some honey pot links with a php file writing the ip to my htaccess file. The problem with this is I am not sure it will 100% keep googlebot and other good spiders out. Robots, nofollow are not 100%?
"A robotted page can still be indexed if linked to from from other sites
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely)"
Obviously passwording will prevent the scrapers from getting written to htaccess as well and noindex is after the event.
What do other people do?