Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google indexing issue with robots.txt

         

tihami

4:08 pm on Jan 19, 2017 (gmt 0)



hi,
I am using the magneto developer platform.
when I am submitting the site map that is shown the below error.

Warnings
Url blocked by robots.txt.
Sitemap contains urls which are blocked by robots.txt.


my robots text is

# Google Image Crawler Setup
User-agent: Googlebot-Image
Disallow:

# Crawlers Setup
User-agent: *

# Directories
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
#Disallow: /js/
#Disallow: /lib/
Disallow: /magento/
#Disallow: /media/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
Disallow: /skin/
Disallow: /stats/
Disallow: /var/

# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
#Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /catalog/product/gallery/

# Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt

# Paths (no clean URLs)
#Disallow: /*.js$
#Disallow: /*.css$
Disallow: /*.php$
Disallow: /*?SID=

it is not fixing, need assistance on this issue.

how we can fix it.

tihami

jakebohall

4:58 pm on Jan 19, 2017 (gmt 0)

10+ Year Member



It would seem that you are including URLs in your sitemap that are then blocked by robots.txt .. You don't want to include url's in your sitemap, if you are then going to tell Google or other bots that they cannot crawl/visit the page.

You need to go through whatever code is generating your sitemap and remove any urls that match the patterns in your robots.txt file. You might also want to consider linking to your sitemap from your robots.txt file, just as a best practice :)

not2easy

5:52 pm on Jan 19, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Log in to your GSC account and use their "Fetch as Google" tool to see what files are blocked. Start with the site's home page. Often it is a minor issue like some third party script. They show that error even when it is only their own ad script that is blocked.

As jakebohall suggested, it is a best practice to add a link to your sitemap at the end of your robots.txt file like so:
Sitemap: https://www.example.com/sitemap.xml


Another best practice is to "allow" google in that robots.txt file with a line at the top, right after that User-agent: * line like this:
User-agent: *
Disallow:

to say that "nothing is disallowed except for what is on following list".

aristotle

6:54 pm on Jan 19, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



#Disallow: /*.js$
#Disallow: /*.css$


Did you create those lines or got them from somewhere

lucy24

8:52 pm on Jan 19, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



# Crawlers Setup
User-agent: *

# Directories
Is that a verbatim, line-for-line quote of your robots.txt? Blank lines in robots.txt (unlike in htaccess) have syntactic meaning: “I’m done with this section; now a new section begins.”