Webpages not being crawled by Screamingfrog app

Forum Moderators: phranque

Message Too Old, No Replies

Webpages not being crawled by Screamingfrog app

born2run

3:38 pm on Jun 23, 2015 (gmt 0)

Hi, I just made an archive page of my older articles. These are displayed on a paginated page so one can browse through the entire archive.

However, when I use screaming frog seo tool to crawl my site the archive section is not being crawled at all.

What could be the reason? thanks

aakk9999

4:17 pm on Jun 23, 2015 (gmt 0)

There could be few:
- pages blocked by robots.txt and Screaming Frog honouring robots.txt
- something blocking Screaming Frog user agent

Have you checked your server log files? Can you see requests from Screaming Frog? If so, what does the response say?

born2run

4:21 pm on Jun 23, 2015 (gmt 0)

Are these the access-log files in Apache? What should I search for in that file? Thx

engine

4:32 pm on Jun 23, 2015 (gmt 0)

robots.txt is the first file to look at - it should be in the root, if you have one, and it should give a list.

For example, this blocks everything

User-agent: *
Disallow: /

aakk9999

4:47 pm on Jun 23, 2015 (gmt 0)

Can you crawl other parts of your site sucessfully? If so then it is either robots.txt block or your URL for archive section is not created in the way robots.txt could understand it.

lucy24

4:53 pm on Jun 23, 2015 (gmt 0)

Emphasis mine:

The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google.

It will check robots.txt of the (sub) domain and follow (allow/disallow) directives specifically for the Screaming Frog SEO Spider user-agent, if not Googlebot and then ALL robots. It will follow any directives for Googlebot currently as default. Hence, if certain pages or areas of the site are disallowed for Googlebot, the spider will not crawl them either. The tool supports URL matching of file values (wildcards * / $) just like Googlebot.

My, how familiar that sounds.

Edit:

The user-agent switcher has inbuilt preset user agents for Googlebot, Bingbot, Yahoo! Slurp, various browsers and more. This feature also has a custom user-agent setting which allows you to specify your own user agent.

I don't perfectly understand why this is necessary or even desirable. Wouldn't misrepresenting your UA just make you more likely to be blocked outright?

aakk9999

7:05 pm on Jun 23, 2015 (gmt 0)

There is an option in Screaming Frog to ignore robots.txt exclusion - perhaps worth trying?

born2run

2:09 am on Jun 24, 2015 (gmt 0)

I shall check robots file today and report back here

born2run

2:15 am on Jun 24, 2015 (gmt 0)

Yeah I checked robots file looks ok

born2run

3:12 am on Jun 24, 2015 (gmt 0)

Yeah this is puzzling it (seo spider) goes only upto 6 pages within the archive page (the articles are paginated). Then it doesn't index any other the articles within it. What could be the reason? Thanks!

lucy24

3:16 am on Jun 24, 2015 (gmt 0)

Does "6 pages" mean that it has to follow five links? Or does the first page have direct links to each of the other pages? I'm thinking there might be a setting for recursion depth.

born2run

4:03 am on Jun 24, 2015 (gmt 0)

Yeah the first page has direct links to other pages. Where is the recursion depth setting? Thanks

born2run

10:01 am on Jun 24, 2015 (gmt 0)

Well I included the wildcard path for that archive page and it crawled them fine. I guess it takes time to complete 100% would have to wait.