Forum Moderators: Robert Charlton & goodroi
There are a couple of scenarios in the piece.
Before diving in, I'd like to briefly touch on a concern webmasters often voice: in most cases a webmaster has no influence on third parties that scrape and redistribute content without the webmaster's consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our webmaster guidelines. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.How Google Addresses Duplicate Content Due To Scrapers [googlewebmastercentral.blogspot.com]Generally, we can differentiate between two major scenarios for issues related to duplicate content:
Within-your-domain-duplicate-content, i.e. identical content which (often unintentionally) appears in more than one place on your site Cross-domain-duplicate-content, i.e. identical content of your site which appears (again, often unintentionally) on different external sites
Google does get a big bunch of it right. But the situation is something like killing 99% of the mosquitos in a room and then hoping to take a rest. That 1% is still going to buzz in your ear! As I read the comments, there are three main reasons why the original might be outranked by a scraper:
That second one is not often talked about - but I assume it can make your page look like the derivative page instead of the original.
Maybe Google, for the purposes of duplicate content, should at least spider (if not index) ALL content so that it can at least be appraised of the original source?
Once the Indian copy had been removed, it took several days for the original to re-appear; helped along in part by the original having another 25% extra content now added to it.
Why do scrapers still appear above the original site?
The scrapers may be higher because your site has spam penalties for reasons other than having duplicate content. Sometimes just cleaning up anything dodgy on a site without changing the content will get it to rise above the scrapers.
Something I wrote last week, also appeared on an Indian website the very next day, and the original article immediately dropped out of Google.
Exactly. I don't agree that Google is really very good at identifying the originator of content. I've original stuff I wrote that had been online for years drop out sometimes when someone copies it. When anyone with an authority site scrapes content from one of my newer and/or lower ranking sites, they seem to get away with it just fine, at least algorithmically.
Can someone point me at the email or other means that Google used to get my consent before it scraped and distributed my content?
Pot, Kettle, Black and just more fog!
The bit that made me laugh was "... you might have the case of someone scraping your content to put it on a different site, often to try to monetize it." - so I would not find a single Google ad on any of those sites as the means of monetization, would I?
The scrapers may be higher because your site has spam penalties for reasons other than having duplicate content. Sometimes just cleaning up anything dodgy on a site without changing the content will get it to rise above the scrapers.
i hear this a lot. Your saying theres a logic to an algo that places a site that writes original content below a site that just uses everyone else's content? Isnt a scraper spam by default?
This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.
Based on that quote, Google can determine this in seconds? And there is no "ripple" effect for the targeted domain? That is not what I've seen in my findings. It "takes a while" for Google to figure out the authoritative source. During that time, the scraper is above the original resource. If the scraper has more PR than the target domain, I think the scraper wins in the short run. That's the part that really sucks! It could be months before all that gets figured out.
And, if someone is doing a collaborated attack using scrapers against the target domain, I think the time factors are extended quite a bit. Just me "Tin Hat" theory. ;)
Sharp minds are going to greet you with a lot of skepticism if the stolen content you claim that isn’t hurting anybody is uniquely linked to your profits. Its like a politician defending a company accused of wrongdoing while at the same rime having an interest in the company with a half million shares of stock. Whether real or imagined the suggestion of improprieties will exist.
Bottom line is “Google loves stolen content.” In a digital age even exposure of stolen content with Adsense on it for short durations of time, even minutes, enriches Google.
i hear this a lot. Your saying theres a logic to an algo that places a site that writes original content below a site that just uses everyone else's content? Isnt a scraper spam by default?
If your site has spam penalties, perhaps due to something like dodgy link practices, it may easily rank lower than a scraper site, even if you have lots of original content.
The scraper sites with your content linking above your site may be there because of unrelated spam penalties on your site and may not be the cause of your lower ranking.
The scrapers may be both a cause and an effect, or they may be an effect but not a cause, or some variation in between.
Most of the time, if your site is older and has decent page rank, in my experience the scrapers are an effect but not a cause. Usually the duplicate penalties come in when nonscraper, decent ranking sites copy your entire content and your original content gets flagged as the duplicate (which I think happens more then the Google engineers may realize).
My real bugbear is scrapers getting content of mine from parts of my site that are blocked by robots.txt
Easy, set up a bot trap and put it in your robots.txt.
What bothers me with Google is how many hoops you have to jump to get them to close down a page in Blogger or Blogspot. And those are sites with usually no means to contact the blogger.
PS: Note to other social and blog sites out there, no, I won't register to your stupid web site just to be able to contact one of your infringing blogger or user. Hosting copyright infringers is your problem and you should be easily contacted, and you should act quickly. And I'm also looking at you, Yahoo! answers. Take some freaking responsibility.
No mention is made of PARTIAL duplicate content - such as with an image found elsewhere (used with permission) with your personal comentary added. Or perhaps you write an article about a tree and the tree image is from a free stock resource site so it's on 200 other sites too. The image(s) is duplicate, the text isn't, does that make the page duplicate?
I'd be happy if they could just send me a Google Alert when they came across one of my photos somewhere other than my site.
But I don;t think that'll happen any time soon.
I just a few minutes ago stumbled across one of my images being used on another site without my previous knowledge.
That said, someone mentioned that a copy might outperform the original if the original changed after the copy was made, if I understood correctly.
That bothers me because while I change my pages as I learn more about a the widget on the page, but more often I add other content to a page. It sounds like that could hurt my rankings, which I wouldn't especially care to have happen.
Good scrapers can deal havoc on the original site [ and I'm not going to reveal how here - as folks with these intentions may read these threads. ] - but anyone faced with this knows it exists and what I'm talking about.
The issues go still further on the same principles of legitimate syndicated content. There are good instances where the syndicated content can be presented in a manner that is better for users or varied.
So I make the point that Google's algo may not have sufficient intelligence to recognise this, if all it does is recognise the original document [ indeed if it can - 100% of the time ]. Maybe the non original content sometimes deserves to rank higher .
[edited by: Whitey at 4:52 am (utc) on June 10, 2008]
A grammar filter might be a start, because good content does tend to have more correct paragraphs, but then where does that leave your navigation and list pages?
Google is probably in the scraping industry just so they can make themselves MORE money! As if they don't have enough already ...
In addition to this, some scrappers are clever enough to slightly modify the title of the article to make it appear different from the original.
I'm not sure if I'd go as far as to call this "tactic" clever (it seems obvious to me), but it is definitely annoying.
It only adds to the difficulty of combating scraping.