Chilling Effects DMCA Archive Deletes Self From Google

Forum Moderators: not2easy

Message Too Old, No Replies

Chilling Effects DMCA Archive Deletes Self From Google

bill

1:20 am on Jan 13, 2015 (gmt 0)

https://www.techdirt.com/articles/20150112/06545529675/chilling-effects-chilling-effects-as-dmca-archive-deletes-self-google.shtml [techdirt.com]

Chilling Effects On Chilling Effects As DMCA Archive Deletes Self From Google

Over the weekend, TorrentFreak noted that the website Chilling Effects had apparently removed itself from Google's search index after too many people complained.
This week, however, we were no longer able to do so. The Chilling Effects team decided to remove its entire domain from all search engines, including its homepage and other informational and educational resources.

ken_b

1:27 am on Jan 13, 2015 (gmt 0)

Seems to be a bit of confusion about this. From the same article:

Meanwhile, Chilling Effects founder, Wendy Seltzer, seems to insist that this was an implementation mistake and that the team never meant to remove the whole domain:

rogerd

1:35 am on Jan 13, 2015 (gmt 0)

I've seen this happen more than once... a development site has indexing blocked, and then gets pushed live without anyone noticing. (If the second comment in the article is correct.)

lucy24

2:54 am on Jan 13, 2015 (gmt 0)

<meta name="ROBOTS" content="NOODP" />

I took my best guess about the domain name and then looked at some random pages.

Psst! ChillingEffects! HTML doesn't require a closing slash.

lammert

9:45 am on Jan 13, 2015 (gmt 0)

Looking at the robots.txt I wouldn't call this an implementation mistake by some developers. Everything except the /pages subtree where the about page resides is deliberately blocked for all bots. It's easy to reverse but they didn't do it until now, so they probably want it this way.

Anyone knows what the Google-Legal-Removals bot is doing BTW? Is that bot used to scrape the DMCA notices from the chilling effects site?

User-agent: Google-Legal-Removals
Disallow:

User-agent: Googlebot
Noindex: /
Allow: /pages

User-agent: *
Disallow: /
Allow: /pages

not2easy

3:35 pm on Jan 13, 2015 (gmt 0)

No need to guess at the domain name, it is linked to from here in the Charter: [webmasterworld.com...]

Idly wondering if their content is being "blocked" from any other crawlers?

lucy24

7:31 pm on Jan 13, 2015 (gmt 0)

User-agent: Googlebot
Noindex: /
Allow: /pages

wtf? Does Allow: also mean "override a Noindex: directive"? Does anyone other than Google use this?

Everything except the /pages subtree

... and that's only if the crawler understands the Allow: formulation. They don't have to; so far there's no robots.txt 2.0 standard.

When a robots.txt file mentions user-agents other than yourself or * doesn't it tend to mean that everyone sees the same file?

aakk9999

3:51 am on Jan 14, 2015 (gmt 0)

User-agent: Googlebot
Noindex: /
Allow: /pages

wtf? Does Allow: also mean "override a Noindex: directive"? Does anyone other than Google use this?

I have used "Allow: some-url-pattern" before. It is Google specific: Block URLs with robots.txt [support.google.com]

Noindex: /

Never heard of this within robots.txt. As far as I know, noindex is a directive that is declared only in meta robots or is returned within X-Robots-Tag as part of HTTP response headers.

So if they were to block everything apart from /pages then I would expect their robots.txt to look like this instead:

User-agent: Googlebot
Disallow: /
Allow: /pages

But would not expect to see Noindex: / syntax. Have I missed something new?

not2easy

4:09 am on Jan 14, 2015 (gmt 0)

The Noindex: / directive in a robots.txt is meaningless - unless it is a recent secret change not mentioned anywhere - it does nothing. Most of that file would only confuse Googlebot imho.

Official standards haven't been updated since 1997 (for HTML 4.01) but Google and Bing both recognize and follow the Allow:

I have read that they use the subsequent length of the text string that follows both Disallow: and Allow: to decide whether to pay attention.. sort of iffy.

phranque

5:38 am on Jan 14, 2015 (gmt 0)

at one time google recognized the Noindex: robots.txt directive as an undocumented feature.

lammert

7:47 am on Jan 14, 2015 (gmt 0)

Noindex: validates in Google's robots.txt checker, so it should be OK.