Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

5 years later, Googlebot is still crawling my old no-HTTPS URLs

410s? robots.txt?

         

guarriman3

3:27 pm on Feb 23, 2024 (gmt 0)

10+ Year Member Top Contributors Of The Month



In 2019, I removed my old HTTP site (300k URLs), and I consolidated some contents. This is,
http://example.com/photos/my-product-name (no HTTPS) -301-> https://example.com/product/my-product-name#photos (HTTPS and consolidated)

http://example.com/product/my-product-name (no HTTPS) -301-> https://example.com/product/my-product (HTTPS and consolidated)


However, five years later, Googlebot keeps on crawling the original no-HTTPS URLs. And showing, as 'Referring page', other no-HTTPS URLs:
http://example.com/product/my-product-name
http://example.com/category/my-category-name
http://example.com/product/another-product-name


Some facts:
  • I've checked the canonical meta tag of 'https://example.com/product/my-product-name' and it's ok
  • None of the no-HTTPS URLs are within a Sitemap file. http://example.com/sitemap.xml redirects to https://example.com/sitemap.xml
  • The 301 redirect of 'http://example.com/photos/my-product-name' and 'http://example.com/category/my-category-name' works ok. They redirect to the desired new URL.
  • The internal linking is ok, everything links to HTTPS

    Two questions
  • After 5 years, should I implement somehow 410s for the old no-HTTPS URLs to avoid Googlebot to crawl?
  • Some of the old URLs were never indexed (they had 'noindex'). May I even include them into the robots.txt?
  • tangor

    9:38 pm on Feb 23, 2024 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I wouldn't worry about it. The pages are not there, there's nothing to "index" or "rank."

    Search Engines (all of them) check on old urls from time to time and have been since day one.

    NOTE: even if you feed a 410, they will ask for the same thing once again.

    View is as g's time to waste in crawl on their side versus YOUR TIME to 410.

    lucy24

    12:53 am on Feb 24, 2024 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    How often are they crawling? Every now and then, or as often as when the pages existed at the original URL? If it's only now and then, think no more about it; G### never forgets an URL. Sometimes you'll see an upsurge in requests when some old shopping list rises to the top for a few weeks. (Actually I see this more often with bing, which either has a longer memory or is a slower learner. but G### will do it too.)

    The option of changing the 301s to 410s is a judgement call only you can make. Do you ever get humans requesting those old URLs? If not, go ahead and 410 them. I assume the old URLs fit some pattern, so you can do the whole thing in just a line or two, which will streamline your htaccess-or-equivalent. Google, unlike some search engines, does recognize the difference between 404 and 410, at least to the extent that it stops requesting 410s sooner. I did something analogous some years ago when I split one site into two. For a few years I scrupulously redirected old-site URLs to the same page on the new site, including single-step redirects when things were rearranged on the new site. But eventually I said, Ah, ### it, and just returned a 410 for all moved directories. There's a site-specific 410 page to cover the remote (but non-zero) occurrence of a human request.

    Your main concern should be making sure any and all humans can find your pages. If it's a commercial site, as suggested by the /category/ and /product/, it doesn't seem awfully likely that someone will come looking for a five-year-old URL.