Forum Moderators: not2easy

Message Too Old, No Replies

Articles in own directory not getting indexed

         

Fortune Hunter

3:47 pm on Apr 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not sure if this is the right spot for this post or not. I have a few articles I have written for my site. For ease of management I have placed all the pages those articles are on into a separate folder called "Articles".

I discovered when checking my Google site maps for two different sites that are set up the same way that when Google tries to access that directory it gets a 403 error. That obviously means none of these articles are being indexed.

I really don't want to dump those articles in the main directory because it is a pain to manage them in there with all the other pages. However not having them indexed is also an issue. I have to believe if Google bots are getting this error when they try to access that directory and index them that other search engines are as well.

I am not sure why this would happen in the first place. The directory is not write protected and it only has HTML and PDF docs inside and other folders I put things in don't seem to have the same problem. I am looking for some advice of how to handle this.

Fortune Hunter

2:23 am on Apr 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I can't be the only this has ever happened to.

coopster

2:10 pm on Apr 26, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



when Google tries to access that directory it gets a 403 error

I must be missing something so forgive me if I am stating something obvious. Why is Google getting a 403?

purplecape

9:10 pm on Apr 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I couldn't understand the original post either, and I've never had this experience.

One thing does occur to me: Are these articles accessible via links, or are you just telling Google to index them via the site map? If they are in a directory and NOT linked to from anywhere else, Google bot may not have permission to browse the directory. Hence, the 403.

Fortune Hunter

2:44 am on Apr 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have set up an XML site map and submitted it via Google's webmaster tools. In there it gave me a warning and when I viewed the warning it said it had received a 403 Forbidden error when trying to index the files in the article directory. It gave this exact same error on two different sites that just happen to have an articles directory set up the same way.

The articles themselves are inside the folder on HTML pages, but there is also an HTML page outside the folder called Articles.htm, which if you view it has the name of the articles on the page, which serve as a link to the first page of the article in the folder. As far as I can tell there should be no reason why the Google bots can simply follow the link into the folder and index all the pages, they are all connected and there is nothing special about the folder that should prevent them access.

coopster

5:58 pm on Apr 28, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Well, yes, there must be otherwise the pages would not be returning a 403 Forbidden. Can you access the resource(s) via a browser without having to authenticate (login)?

A 403 error code means the user agent cannot access the requested resource. It may mean the wrong username and/or password were sent in the request, or the permission settings forbid access to the resource, or perhaps even that no default directory index page is present. The Apache directive

DirectoryIndex
defines the default index page name(s).

If you are not requiring authentication or if the page is not a cgi script or something that requires special permission settings for access to the resource, then maybe you are looking at an index error. Are you certain Google isn't telling you that it is receiving that error for a missing index page? Perhaps it can see the other resources, but for some reason it is also attempting to find an index page in a directory where an index page does not exist?

Fortune Hunter

9:29 pm on Apr 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



but for some reason it is also attempting to find an index page in a directory where an index page does not exist?

I suppose this is possible, but my question would be why look for an index page in the directory at all? The article page on the outside is pointing to the exact page in the folder where the file is at so I can't see any reason why it should even need one.

The directory is not password protected or have any type of CGI script or any other script associated with it.

coopster

9:46 pm on Apr 28, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Do you have the directory that contains the articles in your sitemap?
http://www.example.com/articles/

If so, Google is going to try and follow that link and since there is no directory index file in there you get the 403 forbidden. It must be following the link from somewhere, you'll have to chase it down. I would begin by reviewing your sitemap XML file.

Fortune Hunter

7:27 pm on Apr 29, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If so, Google is going to try and follow that link and since there is no directory index file in there you get the 403 forbidden.

Ah, we might be getting somewhere. Yes, the directory is in my site map. If I understand your comment if I am going to have it on my XML site map than I have to have an index page in that directory to crawl to, is that correct?

Here are two possible solutions, which would you recommend...

1. I put the articles page that is currently outside the directory inside and rename it index.htm since it will be in the actual articles directory it won't cause a problem.

2. I put another page in the directory called index.htm and have the articles page point to that page and just have that page be a listing of the article titles the same as the page on the outside is.

If I do number two will Google penalize me for having the exact same page both inside and outside the directory, but named differently?

It never dawned on me that having it on the site map with no index page in the directory would cause this issue.

coopster

9:02 pm on May 1, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Well, I'm not saying that is the issue exactly. I'm asking if you have the directory itself listed without any filenames as a link in your sitemap xml. For example, you should have these ...
http://www.example.com/articles/article1.htm 
http://www.example.com/articles/article2.htm
http://www.example.com/articles/article3.htm
http://www.example.com/articles/article4.htm
http://www.example.com/articles/article5.htm

but if you do not have an index in the directory then you should not have this in the sitemap xml:
http://www.example.com/articles/

Take it out if it is there. Then Google won't try to follow the link and put it in their index.

Fortune Hunter

12:06 am on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



but if you do not have an index in the directory then you should not have this in the sitemap xml:

http://www.example.com/articles/

This is exactly how it was set up including not having the index.htm inside the folder, which probably why I was getting the error. It never dawned on me this would cause the problem because I thought it [Google] would simply find the Articles.htm page outside the directory and follow the links on it into the folder, but it did not do that, it used the XML site map instead which was set up like you said.

I changed it by putting the Articles.htm inside the directory and changed its name to index.htm. I updated the site map and re-submitted it to Google. It appears it *might* have solved the problem. I am cautiously optimistic as it seems the errors have disappeared but I want to give it a few days to make sure it solved the issue and it doesn't come back.