Spider Tracking Links - Examining 2 Methods - Crawler, Spider, and User Agent ID forum at WebmasterWorld

Assumptions:

- Allowed spiders are whitelisted and all valid spiders are never shown the crawler IDs

- Everything else that isn't whitelisted and crawls the site is shown a crawler ID (CRID) for tracking purposes, visitors and bots alike, can't tell who's who until it shows up indexed.

LINK POISON METHOD #1 - THE PARAMETER

First method of making poison links is to just add an identifying parameter to the link like "?crid=1410579483" where the CRID is the tracking ID of the crawler, typically the long IP is sufficient, or a database ID if you're using a database. The extra parameter has zero impact on any page since nothing expects those parameters to exist or use them for anything else.

When you see the CRID on the path, you know it's not a real SE coming and going. Reason I never did this kind of link poison, other than a dedicated spider trap link or two, was because once the search engines get hold of those links from 3rd party sites they never let go and keep trying to crawl it.

Why? It's that new parameter, they think it's a new page too, not good.

Now Google WMT's allows you to ignore those parameters and combined with the canonical tag I believe this method can be safely done, at least in Google, without creating a bazillion new pages that the crawler never discards and attempts to crawl for eternity.

Somewhat manageable but too easy to get yourself in trouble with the good SEs if you aren't careful, probably should be avoided.

LINK POISON METHOD #2 - THE PATH

This method is similar to the first method but can be used in combination with robots.txt, which is pretty universally implemented, to stop crawling of the poison pages.

How this would work is to make those poison links part of the path, not a parameter, then block the base path in robots.txt and the good spiders will never have a problem with them after that.

For instance, when a whitelisted spider crawls:

example.com/somepage.html

When something else accesses the site:

example.com/crid/1410579483/somepage.html

The /crid/ (crawler ID) path can be filtered out in a rewrite rule and blocked via robots.txt, which would leave the site virtually untouched.

This is so simple you could easily post-process any page being displayed and make every link on the site a spider trap link when a whitelisted spider isn't crawling. Another way to do it would be to simply insert a <base href="/crid/1410579483/"> into all the pages, but that would assume scrapers implemented base href processing, not likely, and then they'd have the original path without the CRID inserted so let's forget base href!

Bet Google would call it 'cloaking' and slap me down with a penalty.

ADDITIONAL NOTES AND CONSIDERATIONS

When the poison links are in effect, and assuming the whitelisted bots are being properly filtered out, the first time an IP accesses the site in a day using a previously delivered CRID in the path you know it's either

a) a returning direct traffic human, possibly from a bookmarked link or...

b) a returning bot using the poison path to revisit the site again and continue the crawl

Note the advantage here is we're able to easily sort out that the source of the traffic didn't come from a legit SE since the CRID tells the IP of where the link was originally delivered. Additionally, assuming method #2 with CRID in paths, the SE should never display or request the page with the CRID because we're blocking it via robots.txt in the first place meaning all traffic with CRID should be theoretically suspect traffic only, with a few real visitors as we know they do exist between all the bots ;)

To make things more interesting, compare the current IP with the CRID being passed.

Having a mismatched CRID and IP could be a valid reason to throw a captcha on the first page access.

This will also help identify IP pools used by humans (if used in conjunction with a captcha to validate), and help unravel bots IP pools when they are spread out in clouds, multiple hosts, proxies, etc. hopping from IP to IP to avoid being caught.

Obviously known IP pools would need to be filtered out of the captcha flagging so as not to annoy those AOLers, mobile users, and some cable customers, but it's a pretty interesting way to start trapping some of the harder to catch scrapers IMO.

Spider Tracking Links - Examining 2 Methods

incrediBILL

tangor

topr8

incrediBILL

blend27

dstiles

keyplyr

blend27

dstiles

keyplyr

Pfui

not2easy

dstiles

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week