Forum Moderators: Robert Charlton & goodroi
I assume what will now happen is that the feed pages will eventually go supplemental because there are no links to them from anywhere. If they were real pages I would delete them but in this case I can't, as there is nothing to delete. Should I be doing something more? What ultimately happens to a page that is no longer linked to from anywhere but is never actually deleted?
I would prefer to unlink them but I read somewhere that orphaned pages are ill-advised, and as I can't delete them I have made a new page that only links to all those RSS Feed 'pages', purely for them not to be orphaned.
It would still be interesting to know what eventually happens to pages indexed by Google but that become orphaned.
They cannot be deleted because there is nothing to delete.
...orphaned pages are ill-advised, and as I can't delete them...
...uh... er... what? :)
I don't have a clue about xml and rss and all this but...
Aren't RSS feeds XML files?
Delete them. ( heh... what's YOUR favourite button on the keyboard? )
If there'd be no PAGE to delete, ( i'm trying to imagine these are... URLs that trigger some kind of server response or pages generated dynamically on the fly... correct me if i'm wrong ) how would unlinking this "trigger" lead to an orphaned PAGE?
I don't get it... maybe i'm a bit slow.
Whatever you'd like to deindex so bad that supplemental results and gradual dropout won't do ( which would happen to orphaned pages veeeerry slowly ) you just have to take care of the URLs that would generate a 200 response. Meaning if G would still request the same URLs, even though the links to them are now gone from your site ( which it will do btw for a long time, although less and less frequently ) there would NOT be a status 200 as a reply. Not to mention if someone links to them out of fun, they'd be back anytime.
So there has to be a way to serve a 404 for these URLs otherwise they will not be deindexed. ( oh btw. why do you want to deindex them in the first place? Not that i didn't do the same with our phpbb2. )
I've read too that too many orphaned pages would lead to G thinking you're not maintaining your site well enough, but i'm yet to see this cause anything else than trash in the index... for site: searches. Which no one sees.
( Uh... not that we have a single orphaned url on our site, but indexes we shoved into images-only directories got picked up without a single link to them... must be the sitemap at G so do pay attention to that too ;)
If I delete the links, what then? (in the long term)
IF it wasn't linked to by anyone that is, and that's including scrapers, others displaying your feed and so on... also it may come back anytime at a whim during data refreshes, rollbacks, etc.
But...
www.domain.com/345/feed/
You say this URL dynamically generates an XML feed, right?
It will generate it whenever G tries to access it.
It doesn't really matter if there's no link to it, the URL is already recorded at G.
But unless this is a rewritten URL...
there has to be a file, which has the script that generates it. Probably the index that defaults in this directory. If there's nothing else in there i'd delete it altogether although this might be completely wrong ;)
But at least it'd return a 404.
Also there has to be a setting in the CMS that would let you turn the rss feeds off. Then removing the link or not, it'd again... return a 404.
If it's a rewritten URL...
hmm...
then i have no idea ;)
Do the links point to a directory like this?
Do you know what file defaults in there?
...
Ah but anyway... the point is that orphaned files/URLs won't likely hurt if not in the bulk, and if no one links to them. But if you want to get rid of them, you'll need to remove all links, and get a 404 response for the URL. And make sure there's no sitemap, residual links, archives, whatnot that'd keep G assuming it should be there.
If I remove the links, the URLs will eventually go supplemental in Google but will not return a 404 because, as I said, there is nothing to delete and there is no way to prevent the CMS from continuing to serve up the content.
...just ignore the second half.
... btw you could remove the links, and once you see the URLs becoming supplemental go to that URL removal page of Google and... have the URLs removed. Will do the exact same thing as doing nothing though as it will only not show or crawl the URLs for half a year. Then they might appear again.
But from experience if a URL is valid, and had been indexed by G at any given time, GBot will come back periodically to check it out, index it again, but if no links point to it, it will drop out over and over again. So it's gonna be a supplemental showing some old old cache dates, when in the index at all.
Orphaned pages are alive and well in G database :P
Doing cameos every once in a while on the index.
edit: ...yeah what tedster said :D
You could put up a disallow for all /feed/ directories.
I just don't like the "pages disallowed by robots.txt" column in webmaster tools.
You can also use robots.txt to disallow Googlebot from those URLs.
Well yes, but they are already indexed. What I'd like to know is what happens to an indexed URL that is 'unlinked' then goes into the supplemental index but can never actually be deleted. It becomes a 'particular type' of supplemental URL (to paraphrase g1smd) that presumably will remain so, or maybe it does eventually drop out of the index.
If you really want the URLs removed from the Google index, place the proper disallow rule in robots.txt and then use the Google automated url removal tool. There's one choice in the tool that forces a new fetch of your robots.txt and then takes action pretty quickly - in just a couple days.
But even without using the removal tool, robots.txt alone will handle it eventually - even if the URL was previously indexed.
[edited by: tedster at 5:00 pm (utc) on Dec. 29, 2006]
I would prefer to unlink them but I read somewhere that orphaned pages are ill-advised, and as I can't delete them I have made a new page that only links to all those RSS Feed 'pages', purely for them not to be orphaned.
And by doing so, you've provided an entry point for Googlebot to continue to index those orphan URIs which is what you don't want.
I have recently amended the .htaccess file to remove trailing slashes on all URLs (as WordPress pages work either way, with or without).
That concerns me. /file and /file/ are two different locations. Did you 301 the /file/ to /file? If not, you will most likely have some issues to deal with from a dup content standpoint.
<added>
I've never checked server headers on feed pages. What type of server headers are being return when you check that URI...
www.example.com/345/feed/
With .htaccess I have ensured that URLs with a trailing slash redirect to ones without. A feature of WordPress is that all URLs work with or without, and even though there are no internal links to those with a trailing slash, for some time Googlebot has insisted on crawling a few of them with the trailing slash added. Strangely, the site: command lists one or two with the trailing slash (not the ones it insisted on crawling) but the SERPS always feature those without.
The RSS Feeds return a HTTP/1.1·200·OK.
[edited by: Patrick_Taylor at 5:24 pm (utc) on Dec. 29, 2006]
A URL without a trailing / is often assumed to be a filename. However if it has no extension (like .html or .jpg etc) you really do need to make sure that the correct MIME type is specified in the HTTP header for each type of file returned.
If the MIME type is missing, IE makes a guess as to what the content is, by examining the first few bytes of the file, but other browsers often just fail to display the content, or display garbage.
I am aware of a site where all images are shown like JFIF:&\4b&3;d7s[37skq6;@~5w3^")@;4fj.,( etc in Mozilla because of an incorrect or missing MIME type when the content is served.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head profile="http://gmpg.org/xfn/11">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />