Tricky Robot.txt issue

Forum Moderators: open

Message Too Old, No Replies

Tricky Robot.txt issue

Only want the home page crawled.

lgn1

12:47 am on Jun 24, 2010 (gmt 0)

Due to duplicate content issues, we only want the home page crawled

eg. www.example.com

What would the robot.txt file look like just to index the home page URI, with no file extensions and no other files.

I have found no information on how to do this.

tangor

1:05 am on Jun 24, 2010 (gmt 0)

Does your home page have links to other pages? If so consider all of that indexed, too... and if you don't want that, then noindex,nofollow those links, then prepare to include in robots.txt every folder/file on your site...

Per the desire above this does not seem to be a robots.txt issue...

jdMorgan

2:29 am on Jun 24, 2010 (gmt 0)

> then prepare to include in robots.txt every folder/file on your site...

That's not really necessary. Although the universally-supported original Standard for Robot Exclusion defines only the "Disallow" directive, you could disallow fetching of all pages except the home page using 26 Disallows -- or possibly 36, or a few more than that, because robots.txt uses prefix-matching.

So, 26 lines like

 Disallow: /a
Disallow: /b
.
.
.
Disallow: /y
Disallow: /z

would disallow all resources whose URL-paths begin with a-z. And if you use numbers and a few non-encoded punctuation marks as initial characters, you could disallow those as well.

Jim

tangor

3:57 am on Jun 24, 2010 (gmt 0)

Jim, as always, you remain a treasure in knowledge. Still maintain that if Index Page is scanned whatever is linked from that page (which makes the index page WORK) will appear in the serps. We know the gorg gets EVERYTHING, even if disallowed, y and b do the same.

phranque

9:57 am on Jun 24, 2010 (gmt 0)

note that robots.txt is intended to exclude robots from crawling a page.
if a url is discovered through internal linking from your home page or an inbound link, that url will be indexed without the content being crawled.
if you prefer that neither the url nor the content be indexed, you must allow crawling of that content so that it can see either:
- the robots noindex meta tag you have placed in the <head> of your HTML document.
- the X-Robots-Tag HTTP header with a noindex value that is returned with the requested resource.

to index the home page URI, with no file extensions and no other files.

nothing in the robots exclusion protocol will canonicalize your home page url or rewrite to an extensionless url, if that's what your are asking.