We know that Google looks unkindly on scrapers. This duplicated content competes against the individuals/groups creating this content, who often feature adverts from their AdSense network. Sinking these sites to the bottom of the SERPs helps everyone. It helps Google offer credible results, people to find the originators of the content, and keeps the parasites out of the game.
Should the newest parasites, crowd sourced content scrapers, be similarly halted in their tracks before it’s too late?
Crowdsourced content scrapers (Pinterest.com, Weheartit.com, Loveit.com, Ehow.com/spark) are experiencing a surge in popularity this year. Pinterest, in particular, is increasingly throwing its weight around in the SERPs.
The overwhelming majority of the content of these websites is an infringement on someone’s copyright; rare are the people posting original content on, say, Pinterest, where original content may not even amount to 1%, sitewide. For many, Pinterest results in the SERPs are a nuisance, a mere extra step to get to the source website; that is, if the source website is credited appropriately, and not mis-attributed to Tumblr, Yahoo Images, or Pinterest itself. Most people “googling” something want some text, not just pictures and a misleading link.
These crowdsourced content scrapers all have NOFOLLOW outbound links, except for Loveit.com, who might have to shut the door once spammers begin to exploit it. These links are of very little help to the authors whose content is scraped in the SERPs.
Typically, content is scraped via a button that the users install in their bookmark bar, making scraping third party content a breezy, effortless,one-click affair.
Most of the scrapers will create a page with the URL template contentscraper.com/source/yourwebsite.com that often ranks quite highly for your keywords, and your domain name in the search query. Some visitors may prefer to view on content on Pinterest, and not visit your link in the SERPs.
Early on, some webmasters were hyping miraculous referral traffic volume from these scrapers. Lately, there are reports indicating that rather than leaving the confines of the scraper to follow links to the source, scraper visitors tend to remain on the scraper. (http://adweek.com/news/technology/buzzfeed-report-publishing-partners-demonstrates-power-social-web-143194)
A minority of crowdsourced content scrapers offer unique, proprietary opt-out mechanisms.
<meta name="LoveIt" content="nolove">
<meta name="ehow" content="noclip" />
<meta name="pinterest" content="nopin" />
The proliferation of these tags forces content providers into constant vigilance in monitoring new opt-out codes as they arise, and constantly update their websites accordingly. Notably, these aren’t sitewide htaccess commands, they need to be added to every single web page. Not everyone has dynamic content!
There are ways of course to figure out tricks to block these crowdsourced scrapers with htaccess, or substitute the scraped image for a copyright warning, but Ehow’s Spark grabs a screenshot of the browser display (stealing both images and text) and is the ultimate stealth scraper. The act of someone scraper your content with Ehow's bookmark tool is undetectable in web logs, and therefore unstoppable in htaccess.
DMCA take down notices, which were once practical for against conventional scraper, are obsolete against the army of crowdsourced content scrapers, whose users scrape content feverishly, and round the clock.
Should Google level the playing field and severely penalize these crowdsourced, copyright infringement and duplicated content machines?
Or should Google allow them to rise into greater prominence, as they might under current algos?