X-Robots-Tag - controlling Googlebot via HTTP headers - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

X-Robots-Tag - controlling Googlebot via HTTP headers

encyclo

2:12 pm on Jul 28, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

From the recent Google Blog posting Robots Exclusion Protocol: now with even more flexibility [googleblog.blogspot.com], Google have announced the availability of the

unavailable_after

meta element, which enables you to give an expiry date to your pages. (See the thread Google Plans a New Meta Tag - "unavailable_after" [webmasterworld.com] for more information.)

However, there is a second, more interesting, announcement in the same entry: the ability to control Googlebot behavior via HTTP headers rather than on-page meta elements: the

X-Robots-Tag

header.

We've extended our support for
META
tags so they can now be associated with any file. Simply add any supported
META
to a new
X-Robots-Tag
directive in the HTTP Header used to serve the file.

As mentioned in the post, this is very useful for non-HTML content such as PDF, Word or plain text [webmasterworld.com] files, where you cannot insert

meta

elements. You can also reduce clutter in the document itself, as well as control indexing via the server configuration rather than editing the files.

One caveat not mentioned by Google is that only Googlebot supports this syntax - unless the other search engines decide to follow suit - so you will still need

meta

elements for Yahoo or MSN. Also, how long do you reckon we'll have to wait until the first case of a hacked server being modified to send a

noindex

HTTP header with every request?

Inspired

3:24 am on Jul 29, 2007 (gmt 0)

10+ Year Member

Yes, I would definitely be concerned about the ease with which a website on a compromised server could be destroyed.

Key_Master

3:30 am on Jul 29, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

THANK YOU!

engine

3:02 pm on Jul 30, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

This follows on from our earlier post on the matter.
[webmasterworld.com...]

mcavic

7:44 pm on Jul 30, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

wait until the first case of a hacked server being modified to send a noindex HTTP header with every request

If your server is hacked, search engine placement is the least of your worries.

mikomido

7:53 pm on Jul 30, 2007 (gmt 0)

Am I right in thinking that this is basically a "Is-robot: true" or "Is-robot: 1" HTTP header? So we no longer have to sniff the User-agent string and guess whether it's a bot or a human using a Web browser?

jeffgroovy

9:46 pm on Jul 30, 2007 (gmt 0)

10+ Year Member

If your server is hacked, search engine placement is the least of your worries.

LOL, no doubt! Last time my main unix server was hacked I didn't have any search engine placement worries, in fact I didn't have any websites left on it at all...thank goodness for my backup dedicated hosting the downtime was minimal.

If some one has unauthorized access to your website, there's already plenty of ways they can break down your business without any need for a new metatag, someone can already put a nofollow tag and get you out of the serps if they have access to your server.

encyclo

12:10 am on Jul 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

someone can already put a nofollow tag and get you out of the serps if they have access to your server

The comment about hackers was merely an aside and not the main part of my post in any way, but I'll just reply to this: the HTTP header is much more unobtrusive, and therefore much harder to detect, than actually modifying the pages themselves or changing the robots.txt (something which has been reported as occuring in the past in order to remove a site from the index).

Am I right in thinking that this is basically a "Is-robot: true" or "Is-robot: 1" HTTP header? So we no longer have to sniff the User-agent string and guess whether it's a bot or a human using a Web browser?

This is not anything sent by the bot itself, so it doesn't help in identifying Googlebot - it is a HTTP header that you can add to your server's response to a GET request, which offers similar functionality to the usual robots meta elements more commonly seen. You can add the HTTP headers via a server-side scripting language (PHP, etc.) or via the server configuration (Apache httpd.conf, IIS...).

ogletree

1:23 pm on Jul 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This is great but nobody is saying how you would do such a thing. How do you modify your http header on IIS and Apache?

zCat

1:50 pm on Jul 31, 2007 (gmt 0)

10+ Year Member

How do you modify your http header on IIS and Apache?

That's something you'd usually handle at application level, e.g. in PHP / ASP / whatever.

Key_Master

9:00 pm on Jul 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Here's a simple example for Apache that you can include in your .htaccess file to keep Googlebot (and hopefully others in time) from indexing image files. With some modification it can be used to control robot access to other files or file types:

<Files ~ "\.(gif�jp[eg]�png)$">
Header append X-Robots-Tag "noindex"
</Files>

The X-Robots-Tag directive is a small step towards making robots.txt obsolete.

ogletree

9:50 pm on Jul 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't really see the need for this. Why not just stick those files in their own directory and disallow it.

Key_Master

10:03 pm on Jul 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If your only intention is to disallow access to a file or files then a robots.txt would work just fine.

However, you can't use noarchive, nofollow, nosnippet, or unavailable_after in a robots.txt file. The header X-Robots-Tag is a much more powerful tool. It allows us to use these directives without needing to edit files. It also allows us to use these directives for media files, pdf files, etc, that can't have meta tags directives inserted in them. It can also be used for user-agent/ip delivery.