Do SE spiders follow subsequent folders

Forum Moderators: phranque

Message Too Old, No Replies

Do SE spiders follow subsequent folders

myrrh

6:36 pm on Jul 10, 2011 (gmt 0)

If I have folders in the root folder, but no links from pages within the root folder to pages within those lower level folders, will search engine spiders go to pages within those folders anyway?

g1smd

6:42 pm on Jul 10, 2011 (gmt 0)

In order to discover a URL, there must be a link to it from somewhere that the searchengine has already discovered.

That said, searchengines so sometimes discover unlinked URLs by other methods, so you can never be sure.

Is this a question about making sure they DO find it, or is it about what steps to take to ensure that they do NOT find it?

brotherhood of LAN

6:42 pm on Jul 10, 2011 (gmt 0)

No, but there are other ways

- A sitemap
- Links generated from referral strings (people clicking on links in pages within the folder), these end up in other websites log files if you link to them, some of which are publicly accessible and then your link gets picked up by spiders.
- Browsers/add-ons logging which pages visitors visit in the folders

If you would like to prevent public access to folders it's best to explicitly add some authentication or prevention of viewing.

myrrh

7:08 pm on Jul 10, 2011 (gmt 0)

The reason I ask is because I am restructuring a website and all the current pages are in folders in an intermediate folder (I'll call "abc") in the root folder. I am eliminating that unnecessary folder "abc" and intend to put all the folders in the root folder.

So, in the meantime I am creating folders in the root folder that are mostly duplicates (albeit with updated code using style sheets instead of tables for layout) of the currently used folders.

When I finish the new coding, I'll eliminate folder "abc" and change the include that controls the nav menu to make the newly built pages active. I plan on using 301s when I make the change.

My concern about the spiders was that in the meantime, there are pairs of pages on the site with duplicate content.

lucy24

9:07 pm on Jul 10, 2011 (gmt 0)

Unfortunately there is no "in the meantime". As noted in lots of threads in lots of places, once a search engine knows that an url exists, it will keep looking for it forever. But if you do a 301 redirect the search engine should recognize that it's all the same page with no duplication.

tangor

9:12 pm on Jul 10, 2011 (gmt 0)

If on Apache, look at IndexIgnore ... might be of use.

myrrh

9:53 pm on Jul 10, 2011 (gmt 0)

...once a search engine knows that an url exists, it will keep looking for it forever.

Yes, that's true. But that is not my question. My question is: do the spiders find pages that have no links to them but are in folders in the root folder.

g1smd

10:45 pm on Jul 10, 2011 (gmt 0)

They shouldn't but they often do, as they get leaked signals from so many sources.

lucy24

10:57 pm on Jul 10, 2011 (gmt 0)

do the spiders find pages that have no links to them but are in folders in the root folder

Not yet. But remember it isn't only internal links. If anyone anywhere in the universe mentions the page-- or it shows up in a publicly viewable log-- g### will know about it. So it is a good idea to exclude robots from a directory even if there should be no way for them to know of its existence.* Do this at least a week before you actually create the directories, because google outsources its robots.txt reading and it can take an amazingly long time for every last googlebot to get the message.

And make sure the directories are flagged as
Options -Indexes
if that isn't already your default. (It probably is.)

* Yes, this is recursive. The mere fact of mentioning something in robots.txt means all robots now do know it exists-- but if this results in spidering something that was previously unvisited, that's a pretty good reason to lock out the robot at the gate :)

matrix_jan

11:17 pm on Jul 10, 2011 (gmt 0)

I've seen google find and index gallery's n.html (n is a nuber that google just made up)
pages that were not linked to any page, those were ghost pages (no image, a hole in the engine
that should have had returned 404 instead of a ghost)...

I ended up with more than 600 Not found pages in WT. After the fix it's down to around 200. A good but painful experience :)

lucy24

1:08 am on Jul 11, 2011 (gmt 0)

I've seen google find and index gallery's n.html (n is a number that google just made up)

Do pages called galleryn.html really exist for some value of n? If so, they are probably using the standard spammer technique of trying all numbers up to the highest given number of digits. ("Lucy1, Lucy2 ... Lucy23... Lucy 99".) Even a robot can figure that out ;)

matrix_jan

1:25 am on Jul 11, 2011 (gmt 0)

Yepp the gallery images are shown in n.html pages where n is like 1 from m. But that wasn't enough for G, it went far above m, sometimes for hundreds.