Forum Moderators: rogerd & travelin cat

Message Too Old, No Replies

Stopping crawl of /feed/

         

Gemini23

4:27 pm on Jan 13, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can anyone advise - I have several thousand of crawled and not indexed URLs including a lot of /feed/ URLs - which makes it difficult/impossible to find the few URLs that might slip through the net and not be indexed.

There are about 1000 products which get updated weekly and new ones added and old ones dropping off.

1. Is there any way to see which products (woocommerce) are not indexed?
2. The idea of this post - how to stop Google and search engines from crawling URLs with /feed/ in them?

Thanks

not2easy

5:29 pm on Jan 13, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I would use robots.txt. Google follows the directives of robots.txt. A single additional line:
Disallow: /feed/
is all it takes. IF you do not currently have a robots.txt file, then you will need to start with
User-agent: *
before that disallow line.

Gemini23

5:46 pm on Jan 13, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks - I thought Google has stopped following the Robots.txt directive? (2019)

or was this only to do with indexing?

lucy24

5:59 pm on Jan 13, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



robots.txt has nothing to do with indexing, only crawling. An URL can be indexed though not crawled (you’ll see the line about “This site’s robots.txt” blahblah), or it can be crawled and then not indexed (either by your choice or G's).

phranque

11:54 pm on Jan 13, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I thought Google has stopped following the Robots.txt directive? (2019)

or was this only to do with indexing?

robots.txt has nothing to do with indexing, only crawling.

In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we're retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019.

source: A note on unsupported rules in robots.txt [developers.google.com]

lucy24

1:13 am on Jan 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good grief. I had no idea putting noindex in robots.txt--as opposed to an in-page meta, or a global x-robots-tag--was even a thing.

File under: Today I Learned...

Gemini23

12:41 pm on Jan 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for above...

To conclude (and to clarify)
1. The no-index is no longer supported in robots.txt (Google 2019)
2. To prevent crawling - Disallow: /feed/ or Disallow: */feed/ will (or should) prevent search engines from crawling URLs with /feed/ in them. (I had thought that had also been stopped as the no-index but it seems I was wrong)

Is that correct?

lammert

12:52 pm on Jan 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You might have success by setting the X-Robots-tag [developers.google.com] in the header for all responses on /feed/ This is something you can configure server-side.