Forum Moderators: phranque

Message Too Old, No Replies

403 index forbidden

better way?

         

santapaws

12:59 pm on Sep 25, 2011 (gmt 0)

10+ Year Member



wonder whats the best way to handle the search engine query to directories that dont have an index file and this the engine gets served a 403. is there a better way to handle this than just have options index forbidden in htacess?

lucy24

8:09 pm on Sep 25, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



403 is tricky because you get it two different ways. For ordinary humans, it's rarely anything other than a "Nope, no index here" error. For robots, it's "Get thee hence and never darken this doorstep again!"

Would you agree that humans are more important than robots? If so, they have to get a user-friendly 403 page that politely points them to alternatives. Some people redirect all 403s and/or 404s to their overall index page. (This individual user personally can't stand sites that do this.)

In the case of robots, they had no business asking for the page in the first place. It may have been part of a mechanical search: that is, if there is an a/b/c/d/e/index.html they may automatically ask for a/b/c/d/, a/b/c/ and so on. But it is forbidden, so 403 is the right error to give. If they ask for a/b/c/d/index.html by name, they'll probably get a 404. But all of this really is the robot's problem, not yours ;)

If you're getting a lot of requests for some particular nonexistent Index, it may be worth the trouble to find out where they're coming from and deal with them individually.

Oh yes and... I have a couple of directories whose front page happens not to be named index.html. If requests come in for those directories, either as directory alone or with appended index.html, I redirect them individually to the correct page. That's only fair.

phranque

10:55 pm on Sep 25, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



the response should be appropriate to the request.
if you wish to expose the directory to requests, you should either specify a default directory index document or configure the server to auto-generate a directory listing.
if the default directory index document doesn't exist in a specific directory only you can say whether the correct response to that request should be 403/Forbidden, 404/Not Found or 410/Gone (or 200/OK with the auto-index).
if the requested directory actually exists and is used in a valid url structure i would suggest 403 is probably the proper response for that request.

for example, you may serve all your images as /image/example.png
however you don't have a default directory index document in /images/ nor are you interested in exposing the complete list of image files in that directory.
403 is the best response for that request.

In the case of robots, they had no business asking for the page in the first place. It may have been part of a mechanical search: that is, if there is an a/b/c/d/e/index.html they may automatically ask for a/b/c/d/, a/b/c/ and so on.

i would argue that a directory structure in a url should, in the name of url discovery, be hackable by any visitor whether human or bot.
if your architecture does not support requests for content in /a/b/c/d/ or /a/b/c/ then i would also argue that you are probably keyword-stuffing your url path.

I have a couple of directories whose front page happens not to be named index.html.

a request for a directory should end in a trailing slash or be 301 redirected to the trailing-slash url.
otherwise it is simply a request for a resource (file) within that url path (directory).
you are actually describing a directory without an index.

lucy24

12:35 am on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if your architecture does not support requests for content in /a/b/c/d/ or /a/b/c/ then i would also argue that you are probably keyword-stuffing your url path.

You can get it legitimately in directories that contain gazillions of images. Rather than leave them all flopping around loose, you group them into subdirectories. So you will have directories that contain nothing but more subdirectories, but they serve the useful purpose of keeping the webmaster from going insane when looking for a file ;) I've also got a couple of auto-indexed directories hiding inside a non-indexed, non-roboted directory. It's for the benefit of humans. (It makes sense if you know the full context.)

a request for a directory should end in a trailing slash or be 301 redirected to the trailing-slash url.
otherwise it is simply a request for a resource (file) within that url path (directory).
you are actually describing a directory without an index.

We may not be understanding each other perfectly. For example: for arcane historical reasons I have a file called /widgets/foobar.html The "foobar" file is functionally the index file: it contains links to everything else in its directory, and my internal links from other directories point to it.

If people ask for /widgets (with or without slash) or /widgets/index.html they will get hit in the face with a wholly undeserved 403 or 404. That's where the htaccess comes in:

RewriteRule widgets/?(index\.html?)?$ /widgets/foobar.html [R=301,L]

Right now I've got two of them. One of the two will eventually pick up a proper index; the other won't.

Hm. Come to think of it, there's a third one that should get the same treatment. It's either that or rename the splash screen-- but then I'd have to redirect all those ### robots. Naah.

And then there are the e-books, which have their own directory structure for other historical reasons. Sigh.