Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Thousands of "Indexed, though blocked by robots.txt"

Should I be worried?

         

guarriman3

10:17 am on Jul 24, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi,

Due to crawl budget issues and thin content on my website (800,000+ URLs), I decided to remove one section from the Google index.

1) I added "noindex" metatag within the HTML code of all the pages of such section
2) I waited 14 days
3) I added "Disallow: /section/" within my 'robots.txt' file

However, just the day I modified the 'robots.txt' the number of pages within "Indexed, though blocked by robots.txt" (in Google Search Console > Coverage) boosted, to reach 15k URLs.

I thought that Google would be aware that I had decided to exclude such section from the index, after inserting the 'noindex'. Should I be worried about this high number of "Indexed, though blocked by robots.txt" URLs? Is google penalizing my site?

Any similar experience is welcome. Thank you.

not2easy

1:28 pm on Jul 24, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



When you say
I added "noindex" metatag within the HTML code of all the pages of such section
I'll assume that the meta tag was added in the <head></head> part of your html. It can take Google a lot longer than 2 weeks to digest changes. It is best not to block crawling so they can see that it remains noindexed if your noindexed content is remaining on your site.

If you are planning to noindex this content it is best to remove it as it has the same effect over the long term as the noindex tag does. If you are planning to make such changes, I hope that you have done the research to know whether that content, those pages have no traffic and/or inbound links? If there is human traffic and/or inbound links to this content you would lose any benefit they have added to your site.

When you are removing content, it has been useful in my experience to let Google know it is going away with the unavailable_after meta tag:
<meta name="GOOGLEBOT" content="unavailable_after: 24 Jul 2021 10:00:00 UTC" /> 
along with the noindex change. You might want to add a "noarchive" element to that noindex meta tag as well.

They explain the what, how and why here: https://developers.google.com/search/docs/advanced/robots/robots_meta_tag

guarriman3

11:03 am on Jul 26, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi @not2easy, thank you for your kind answer.

I'll assume that the meta tag was added in the <head></head> part of your html.

Correct. I added "<meta name='robots' content='noindex'>" in the 'head' section.

It can take Google a lot longer than 2 weeks to digest changes. It is best not to block crawling so they can see that it remains noindexed if your noindexed content is remaining on your site.

I had read somewhere than 2 weeks was the right optimal waiting time until blocking the crawling bots. Obviously, I was wrong and I should not have blocked them.

If you are planning to noindex this content it is best to remove it as it has the same effect over the long term as the noindex tag does.

The fact is that these URLs mean 0.5% of the traffic of my website, but they are 3% of the total of URLs.

Initially, I would like to maintain the information (not to remove content) because it's useful for 0.5% of my users. But I do not want Google to waste crawl budget on them, and I prefer to boost other URLs, that are more interesting for users and for me (in economic terms). These URLs did not receive external links.

Thank you for the tips to remove the content, but I'm afraid it's not my case. My apologies, I should have expressed myself better from the beginning.

Perhaps the best option is to incorporate the URLs back into Google's crawl (remove it from robots.txt), and check every month to see if they are still indexed. As soon as they are no longer indexed, I block them again in robots.txt.