How Google Addresses Duplicate Content Due To Scrapers - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How Google Addresses Duplicate Content Due To Scrapers

engine

3:24 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

In an interesting article, Sven Naumann from Google's Search Quality Team, helps clarify the issues surrounding duplicate content.

There are a couple of scenarios in the piece.

Before diving in, I'd like to briefly touch on a concern webmasters often voice: in most cases a webmaster has no influence on third parties that scrape and redistribute content without the webmaster's consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our webmaster guidelines. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.
Generally, we can differentiate between two major scenarios for issues related to duplicate content:
Within-your-domain-duplicate-content, i.e. identical content which (often unintentionally) appears in more than one place on your site
Cross-domain-duplicate-content, i.e. identical content of your site which appears (again, often unintentionally) on different external sites

How Google Addresses Duplicate Content Due To Scrapers [googlewebmastercentral.blogspot.com]

santapaws

3:49 pm on Jun 9, 2008 (gmt 0)

10+ Year Member

im not sure id agree with that. Why do scrapers still appear above the original site? Why do sites that have their content duplicated find they return to old positions when they get the offending sites to remove the duplicated text? They may get it right a lot of the time, perhaps within parameters they find acceptable, but the above suggests they get right almost all the time, im not so sure that's accurate. Maybe i should qualify and say theres a big problem IMHO with large chunks of copied content rather than site for site or page for page.

tedster

4:03 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes, there's a good bit of happy face in the message - but given the amount of scraping that actually exists (have you looked at the amount of bot access that goes on in your server logs?)

Google does get a big bunch of it right. But the situation is something like killing 99% of the mosquitos in a room and then hoping to take a rest. That 1% is still going to buzz in your ear! As I read the comments, there are three main reasons why the original might be outranked by a scraper:

Unintentionally blocking googlebot from parts of your content
Changes to the content after it was scraped
The original is weakened in Google through some guidelines violations

That second one is not often talked about - but I assume it can make your page look like the derivative page instead of the original.

santapaws

4:48 pm on Jun 9, 2008 (gmt 0)

10+ Year Member

heres my take on number 2. Google never forgets. It always knows what you had originally.

man in poland

5:23 pm on Jun 9, 2008 (gmt 0)

10+ Year Member

My real bugbear is scrapers getting content of mine from parts of my site that are blocked by robots.txt - Google naturally does not have this content associated with my domain, because it respects the robots.txt file - but the scraper will happily ignore the file and take all the content, presenting it as its own.

Maybe Google, for the purposes of duplicate content, should at least spider (if not index) ALL content so that it can at least be appraised of the original source?

maximillianos

6:00 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Our biggest problem was (is) with scrapers that take our content and embed it on their site wrapped around their navigation, etc. They make a very good effort to masquerade as a legitimate site, which makes it very difficult for Google. Particularly when they are taking stories from our site on the same day we post them, and posting them on their site.

g1smd

6:26 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Something I wrote last week, also appeared on an Indian website the very next day, and the original article immediately dropped out of Google.

Once the Indian copy had been removed, it took several days for the original to re-appear; helped along in part by the original having another 25% extra content now added to it.

Jane_Doe

6:42 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Why do scrapers still appear above the original site?

The scrapers may be higher because your site has spam penalties for reasons other than having duplicate content. Sometimes just cleaning up anything dodgy on a site without changing the content will get it to rise above the scrapers.

Something I wrote last week, also appeared on an Indian website the very next day, and the original article immediately dropped out of Google.

Exactly. I don't agree that Google is really very good at identifying the originator of content. I've original stuff I wrote that had been online for years drop out sometimes when someone copies it. When anyone with an authority site scrapes content from one of my newer and/or lower ranking sites, they seem to get away with it just fine, at least algorithmically.

confuscius

6:50 pm on Jun 9, 2008 (gmt 0)

10+ Year Member

From the article "... a webmaster has no influence on third parties that scrape and redistribute content without the webmaster's consent."

Can someone point me at the email or other means that Google used to get my consent before it scraped and distributed my content?

Pot, Kettle, Black and just more fog!

The bit that made me laugh was "... you might have the case of someone scraping your content to put it on a different site, often to try to monetize it." - so I would not find a single Google ad on any of those sites as the means of monetization, would I?

santapaws

7:11 pm on Jun 9, 2008 (gmt 0)

10+ Year Member

The scrapers may be higher because your site has spam penalties for reasons other than having duplicate content. Sometimes just cleaning up anything dodgy on a site without changing the content will get it to rise above the scrapers.

i hear this a lot. Your saying theres a logic to an algo that places a site that writes original content below a site that just uses everyone else's content? Isnt a scraper spam by default?

pageoneresults

7:29 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.

Based on that quote, Google can determine this in seconds? And there is no "ripple" effect for the targeted domain? That is not what I've seen in my findings. It "takes a while" for Google to figure out the authoritative source. During that time, the scraper is above the original resource. If the scraper has more PR than the target domain, I think the scraper wins in the short run. That's the part that really sucks! It could be months before all that gets figured out.

And, if someone is doing a collaborated attack using scrapers against the target domain, I think the time factors are extended quite a bit. Just me "Tin Hat" theory. ;)

outland88

7:56 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Again Google with their PR machine more or less sends the message not to worry about people stealing your content. You have absolutely no worries. But at the same time without saying a word they’re protecting the stolen content. Its not the stolen content that particularly outranks you it’s the stolen content you never peer at. It may sit 10-50 positions below where you rank. Plus many smaller search engines are wall-to-wall scrapers now.

Sharp minds are going to greet you with a lot of skepticism if the stolen content you claim that isn’t hurting anybody is uniquely linked to your profits. Its like a politician defending a company accused of wrongdoing while at the same rime having an interest in the company with a half million shares of stock. Whether real or imagined the suggestion of improprieties will exist.

Bottom line is “Google loves stolen content.” In a digital age even exposure of stolen content with Adsense on it for short durations of time, even minutes, enriches Google.

CainIV

9:07 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Unfortunately its simply not that easy. With the ease of scraping, it is as likely that someone can find your content and then point a few links to it, or simply have it cached quick than yours. In the case where they strip the article of links, webmasters are left where they have always been around this issue - wondering :)

Samizdata

9:09 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google can determine this in seconds?

Sven Naumann says nothing about the timescale involved and merely claims that Google's technology is "quite good" in "the majority of cases" and "usually works very well".

Soft soap, indeed.

Jane_Doe

9:18 pm on Jun 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

i hear this a lot. Your saying theres a logic to an algo that places a site that writes original content below a site that just uses everyone else's content? Isnt a scraper spam by default?

If your site has spam penalties, perhaps due to something like dodgy link practices, it may easily rank lower than a scraper site, even if you have lots of original content.

The scraper sites with your content linking above your site may be there because of unrelated spam penalties on your site and may not be the cause of your lower ranking.

The scrapers may be both a cause and an effect, or they may be an effect but not a cause, or some variation in between.

Most of the time, if your site is older and has decent page rank, in my experience the scrapers are an effect but not a cause. Usually the duplicate penalties come in when nonscraper, decent ranking sites copy your entire content and your original content gets flagged as the duplicate (which I think happens more then the Google engineers may realize).

koan

12:13 am on Jun 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

My real bugbear is scrapers getting content of mine from parts of my site that are blocked by robots.txt

Easy, set up a bot trap and put it in your robots.txt.

What bothers me with Google is how many hoops you have to jump to get them to close down a page in Blogger or Blogspot. And those are sites with usually no means to contact the blogger.

PS: Note to other social and blog sites out there, no, I won't register to your stupid web site just to be able to contact one of your infringing blogger or user. Hosting copyright infringers is your problem and you should be easily contacted, and you should act quickly. And I'm also looking at you, Yahoo! answers. Take some freaking responsibility.

JS_Harris

12:33 am on Jun 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

What about images?

No mention is made of PARTIAL duplicate content - such as with an image found elsewhere (used with permission) with your personal comentary added. Or perhaps you write an article about a tree and the tree image is from a free stock resource site so it's on 200 other sites too. The image(s) is duplicate, the text isn't, does that make the page duplicate?

ken_b

1:06 am on Jun 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

>>images...

I'd be happy if they could just send me a Google Alert when they came across one of my photos somewhere other than my site.

But I don;t think that'll happen any time soon.

I just a few minutes ago stumbled across one of my images being used on another site without my previous knowledge.

That said, someone mentioned that a copy might outperform the original if the original changed after the copy was made, if I understood correctly.

That bothers me because while I change my pages as I learn more about a the widget on the page, but more often I add other content to a page. It sounds like that could hurt my rankings, which I wouldn't especially care to have happen.

youfoundjake

3:39 am on Jun 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I know that Matt Cutts touched on this briefly in Feb, [mattcutts.com...]
And in it he did mention what was mentioned in todays post, about the link back to your own site if you syndicate it, different format issues within your own site ( different versions of the same page, printable, pdf).

Whitey

4:49 am on Jun 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't think this fully address' the whole duplicate content issue.

Good scrapers can deal havoc on the original site [ and I'm not going to reveal how here - as folks with these intentions may read these threads. ] - but anyone faced with this knows it exists and what I'm talking about.

The issues go still further on the same principles of legitimate syndicated content. There are good instances where the syndicated content can be presented in a manner that is better for users or varied.

So I make the point that Google's algo may not have sufficient intelligence to recognise this, if all it does is recognise the original document [ indeed if it can - 100% of the time ]. Maybe the non original content sometimes deserves to rank higher .

[edited by: Whitey at 4:52 am (utc) on June 10, 2008]

Rosalind

5:26 pm on Jun 10, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The scraping issue is still a problem, particularly when it comes to partial excerpts, and I don't think Google has a handle on this. I suspect we will see a fully AI bot before this issue is resolved. There's some complete and utter nonsense floating around in the SERPs, but how is Google going to filter it out without a human?

A grammar filter might be a start, because good content does tend to have more correct paragraphs, but then where does that leave your navigation and list pages?

webfoo

12:38 am on Jun 12, 2008 (gmt 0)

10+ Year Member

Scraped pages usually have adsense on them, so it makes sense (literally) for Google to favor them. Whichever page is highest in search rankings, Google would favor - the higher in the rankings, the more clicks it gets. The more clicks, the more adsense are displayed, the more money Google makes.

Google is probably in the scraping industry just so they can make themselves MORE money! As if they don't have enough already ...

jcmiras

5:48 am on Jun 12, 2008 (gmt 0)

10+ Year Member

In addition to this, some scrappers are clever enough to slightly modify the title of the article to make it appear different from the original.

notsosmart

8:34 pm on Jun 13, 2008 (gmt 0)

10+ Year Member

In addition to this, some scrappers are clever enough to slightly modify the title of the article to make it appear different from the original.

I'm not sure if I'd go as far as to call this "tactic" clever (it seems obvious to me), but it is definitely annoying.

It only adds to the difficulty of combating scraping.