Forum Moderators: open

Message Too Old, No Replies

Spider Tracking Links - Examining 2 Methods

         

incrediBILL

12:00 am on Nov 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Assumptions:

- Allowed spiders are whitelisted and all valid spiders are never shown the crawler IDs

- Everything else that isn't whitelisted and crawls the site is shown a crawler ID (CRID) for tracking purposes, visitors and bots alike, can't tell who's who until it shows up indexed.

LINK POISON METHOD #1 - THE PARAMETER

First method of making poison links is to just add an identifying parameter to the link like "?crid=1410579483" where the CRID is the tracking ID of the crawler, typically the long IP is sufficient, or a database ID if you're using a database. The extra parameter has zero impact on any page since nothing expects those parameters to exist or use them for anything else.

When you see the CRID on the path, you know it's not a real SE coming and going. Reason I never did this kind of link poison, other than a dedicated spider trap link or two, was because once the search engines get hold of those links from 3rd party sites they never let go and keep trying to crawl it.

Why? It's that new parameter, they think it's a new page too, not good.

Now Google WMT's allows you to ignore those parameters and combined with the canonical tag I believe this method can be safely done, at least in Google, without creating a bazillion new pages that the crawler never discards and attempts to crawl for eternity.

Somewhat manageable but too easy to get yourself in trouble with the good SEs if you aren't careful, probably should be avoided.

LINK POISON METHOD #2 - THE PATH

This method is similar to the first method but can be used in combination with robots.txt, which is pretty universally implemented, to stop crawling of the poison pages.

How this would work is to make those poison links part of the path, not a parameter, then block the base path in robots.txt and the good spiders will never have a problem with them after that.

For instance, when a whitelisted spider crawls:

example.com/somepage.html

When something else accesses the site:

example.com/crid/1410579483/somepage.html

The /crid/ (crawler ID) path can be filtered out in a rewrite rule and blocked via robots.txt, which would leave the site virtually untouched.

This is so simple you could easily post-process any page being displayed and make every link on the site a spider trap link when a whitelisted spider isn't crawling. Another way to do it would be to simply insert a <base href="/crid/1410579483/"> into all the pages, but that would assume scrapers implemented base href processing, not likely, and then they'd have the original path without the CRID inserted so let's forget base href!

Bet Google would call it 'cloaking' and slap me down with a penalty.

ADDITIONAL NOTES AND CONSIDERATIONS

When the poison links are in effect, and assuming the whitelisted bots are being properly filtered out, the first time an IP accesses the site in a day using a previously delivered CRID in the path you know it's either

a) a returning direct traffic human, possibly from a bookmarked link or...

b) a returning bot using the poison path to revisit the site again and continue the crawl

Note the advantage here is we're able to easily sort out that the source of the traffic didn't come from a legit SE since the CRID tells the IP of where the link was originally delivered. Additionally, assuming method #2 with CRID in paths, the SE should never display or request the page with the CRID because we're blocking it via robots.txt in the first place meaning all traffic with CRID should be theoretically suspect traffic only, with a few real visitors as we know they do exist between all the bots ;)

To make things more interesting, compare the current IP with the CRID being passed.

Having a mismatched CRID and IP could be a valid reason to throw a captcha on the first page access.

This will also help identify IP pools used by humans (if used in conjunction with a captcha to validate), and help unravel bots IP pools when they are spread out in clouds, multiple hosts, proxies, etc. hopping from IP to IP to avoid being caught.

Obviously known IP pools would need to be filtered out of the captcha flagging so as not to annoy those AOLers, mobile users, and some cable customers, but it's a pretty interesting way to start trapping some of the harder to catch scrapers IMO.

tangor

1:39 am on Nov 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been using your white list philosophy for several years, Incredibill, and this expansive post merely confirms the validity and provides a support method for dealing with the bad actors. Thanks!

topr8

6:36 pm on Nov 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... additional consideration.

users who choose to link to you, will link using a page that will never be indexed by legit bots ... so incoming link benefit is lost.

incrediBILL

7:02 pm on Nov 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



users who choose to link to you, will link using a page that will never be indexed by legit bots ... so incoming link benefit is lost.


See, that's why I posted this, I knew someone would catch something I overlooked!

I could obviously risk missing a few bots and whitelist big ISPs where the residential IPs are easy to determine like comcast, road runner, etc.

FWIW, I have thousands of IBLs at the moment and could probably run links like this for a year without any really bad implications. Another way to tackle that linking problem would be to provide a "Link to this page" option on each page which provides only clean links.

... or, if it's Googlebot, the possible solution could be just let it in and use the canonical tag to fix the linking problem.

blend27

8:09 pm on Nov 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill, I've set this one a while back on one of the sites, WhiteListed bots by IP Ranges + RDNS + Sever Farm IP Ranges locked by default.

this is in CF + JQuery

CF Part = set prefix for all links generated on the page.

1. <cfset spiderPath = 'search_engine_spiders/' & Replace( cgi.remote_host,'.','','all' ) & '/'>

All links generated like this:

<a href="#spiderPath#link1.html">Link1</a>


The above code generates a normal link that points to:

search_engine_spiders/123123123123/link1.html, all links are generated like this on a first 2-3 pages.


JQuery part(all though could be written in basic JS)

2. <script>
$('a').each(function() {
$(this)
.attr('href',$(this).attr('href').replace(#spiderPath#', ''));
});
</script>


Then JQuery strips 'search_engine_spiders/123123123123/' from all links client side.

Link bookmarks as a normal link :)

search_engine_spiders/ is disallowed in robots.txt, anything hits that vertual directory contents(.htaccesed) gets a captcha = 2 srikes u are out for a day.


For All browsers that run JS it will ACT as normal: URL = link1.html


Works like a charm on Scrapers.

P.S. I really don't care about the visitors without JS enabled at this point; After running AB Tests for the past decade on my sites and being in a niche that caters to 99.99 users that have it enabled, at least for the root URL. Don't really care about Prefetch, since proper headers are sent by the server to the users browser from the get go.

dstiles

10:30 pm on Nov 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I agree about prefetch BUT...

I browse with JS off and so do many of my clients. It's a standard setting in Firefox+NoScript and NoScript is installed in a significant percentage of browsers now. The option to allow JS via NoScript is easy so it's no problem browsing with JS off for 99.9% of sites. Even good shopping sites should allow purchase with JS off - although many of them can't be bothered.

keyplyr

11:33 pm on Nov 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Even good shopping sites should allow purchase with JS off

Not from a retailer's perspective. They (we) see JS as a fast, cache-able and effective tool in product presentation and functionality.

Only knowledgeable techies *may* see JS as something with potential for exploitation, thus selective when browsing.

I use JS everywhere on my sites. I do display a reminder when browser support is not detected.

blend27

12:56 am on Nov 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's a standard setting in Firefox+NoScript and NoScript is installed in a significant percentage of browsers now.


Same thing here, Exclusively FF/ NoScript / Better Privacy with 1 minute interval to Nuke local .SOL files, Default browser, every one I know gets a 3 minute "Walk-thru" on how to use it, everybody says WOW.

BUT. FF has a setting to allow Cookies from 1st party only and NoScript allows 1st Party JS with no problem(not by default). User must see the a reminder that the content is visible with JS ON.

Too much Techy Stuff?... On my Work machine, GOOG wount show basic search results without "Allow google.com", the search result is there with display:none. On my home machines Hulu sometimes wount run without JS and Flash, nor Netflix(no Silverlight = popcorn goes bad), same on XFINITY(comcast/hulu). I know how to get around it, but a normal user wount bother with that.

That is from the Client side A/B, from what I see for a few years now. I truly DO NOT CARE on my ECOM Sites for the techie that does not trust my site to display best possible content/experience.

For some reason everytime we get to speak bout preventing Scarpers from doing their doodoo, "JS Enabled" comes up....

Sites that load content via JS, site that serve content to REAL Users via JS/AJAX the proper way, a bit of cloaking, DO Rank, with no problems if codded properly with the PROPER User in mind.

Just my 2 cents.

dstiles

8:26 pm on Nov 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Keyplr - I disagree but let's agreee to. :)

I design my ecommerce sites so they validate correctly with or without JS and suggest it be enabled only where it does the customer any good. Ditto cookies, which are session only.

Blend27 - I don't use google and disable everything about it (robtex now uses google as its engine and it's becoming annoying - I will ditch the site rather than succumb). I turn on JS when I have to but only if I trust the site.

I get annoyed when a site uses JS for links, in menu or in text, and usually leave when that happens.

keyplyr

8:50 pm on Nov 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month





I get annoyed when a site uses JS for links, in menu or in text, and usually leave when that happens.

IMO you are in the "techie" category.

Nowadays most sites use some version of Ajax/JS navigation. Like it or not, it's part of the "modern" look. Generally speaking, return traffic increased 20% across the several sites I installed with this type of navigation. And every web page on every one of my sites validate with/without JS.

I'm just not as exclusive as you. I don't block as many ranges or UAs as you seem to do. That's just me. I belong to both categories of user. I've had NoScirpt installed in FF since day one, but I toggle it back'n forth to allow sites as needed. Most users do not use NoScript. Most users barely understand what a browser does.

Pfui

11:23 pm on Nov 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Most users barely understand what a browser does.


Agreed. Heck, this "techie" barely understands what this thread's initial posts are about (pre to-JS or not-to-JS). Implementing crawler IDs (CRID) and poison links?

Could someone please speak slowly and sketch out a thumbnail version of how you do that? (Let alone how you do that on Apache, with mod_rewrite.) TIA!

not2easy

2:32 pm on Nov 14, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I've just started looking into whitelisting and prefer the concept of the path implementation -maybe only because of problems I've had in the past with parameters getting indexed before I could even deal with that.

dstiles

10:09 pm on Nov 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Keyplr - I see quite a few UK shopping sites as part of my work and most have plain link navigation. If they have JS navigation I cannot verify them so they do not get listed.