Some questions about ranking erosion and crawl budget

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Some questions about ranking erosion and crawl budget

guarriman3

3:04 pm on Feb 3, 2020 (gmt 0)

My website offers 2,000,000+ pages, and in June 2019 the SERP ranking of the pages started to decrease gradually. Currently, I've got 50% of the traffic of the last year.

I've been analyzing the web statistics, and around 150,000 of the webpages had no visit in the last five months. I've read about the concept of "Crawl Budget Optimisation", and I'd like to know your opinion:

Can the ranking-drop be related to the Crawl Budget?

The ratio between "URLs in my Sitemap" and "Pages crawled per day" (of Google Search Console) is 50 (2,000,000/40,000). Should I make it be lower?

Is it worth inserting a "<meta name=robots content=noindex>" tag in each one of the pages with no visits?

Any recommended tool to handle/automate the server logs?

Thank you.

goodroi

12:25 pm on Feb 4, 2020 (gmt 0)

Can crawling & ranking be related? Yes
Should I make the sitemap ratio lower? If you want better rankings you don't want to make it harder for Google to crawl your site.
Is it worth adding noindex to unvisited pages? Wrong way of looking at it. You want to cull the duplicates & weak pages that is not necessarily the same as unvisited pages.

Some people face this type of situation as a 2-dimensional situation but really it's more of a 3 dimensional situation with many different aspects to address. Boost the backlinks and Google will want to crawl you more. Enhance the content making it unique,valuable & fresh will make Google more interested. Focusing on the technical side to boost speed & slim down file size will stretch your crawl budget much farther. The best approach is usually a multi faceted one at least that has been my experience.

guarriman3

5:01 pm on Feb 4, 2020 (gmt 0)

Hi @goodroi,

Thank you very much for your concise answer. The case is that I created such amount of pages (2M+) by increasing somehow the number of pages, based just on a database of 50,000 items.

For each one of the items, I created a different URL: photos of the item, opinions of the item, how to buy the item, etc. Ten URLs per item, reaching different 500,000 URLs for the same domain. Some of the URLs are plenty of information (just because there are opinions, photos, etc.), but 80% of them have litte data, which turns them into pages with very little information.

Additionally, I created an AMP version for each URL, and two different languages for the information. So I have around 2,000,000 URLs in the same website. And, as far as I'm starting to understand, I've got a problem with the crawl rate of this website of such great size.

As you mention, I should address the problem in several dimensions (backlinks, speed, etc.). But, obviously, I'd like to focus now on the content dimension.

You mention the possibility of "culling the duplicate & weak pages". You mean create a list of those pages and then insert a 'noindex' on them? Or inserting them within the 'robots.txt'? Any criteria to consider one page as duplicate or weak?

Initially, I'm considering to remove some of the "artificial" URLs (photos, opinions), and including such data in the main page of the item. My initial purpose was to get good ranking for the search "photos of nameoftheitem", but it may be a good idea to include the photos at the bottom of each page of the items.

And I'm also considering to keep only the AMP versions of the main pages of items, to reduce a big share of the URLs to be crawled by the Googlebot.

aristotle

1:44 am on Feb 5, 2020 (gmt 0)

You may have it backwards.
The googlebot crawling rate for a page partly depends on how much traffic google search sends to that page. So

More google traffic --> higher crawling rate

Less google traffic --> lower crawling rate

So the traffic determines the crawling rate.

For the site overall:
Falling traffic --> less crawling

So you may be looking at it backwards

lucy24

5:12 am on Feb 5, 2020 (gmt 0)

Is it worth inserting a "<meta name=robots content=noindex>" tag in each one of the pages with no visits?

It would hardly make a difference, since the pages aren't getting visited anyway. (I'm assuming you meant no human visitors. Search engines always visit sooner or later, at intervals of much less than six months.)

This leads to the question of: What are these unvisited pages for? Do they, for example, hold extremely rare and specialized content that would be of value to one human every three years? If so, you certainly don't want to noindex the page, or you lose that one human who might otherwise become devoted to your site after discovering that you are the only person who could help them.

For me, noindex most often means that I don't want this page to be an entry page: Its content could be useful and appropriate to someone who is already on the site, following internal links, but the chances are more-or-less zero that it is what some random searcher, coming from the outside, was looking for.

lammert

6:20 am on Feb 5, 2020 (gmt 0)

(...) but 80% of them have litte data, which turns them into pages with very little information.

It doesn't only turn them into pages with very little information. It also probably turns them in pages of which Google has no clue what they are about and for which keywords they should rank. You shouldn't expect any traffic for them from the search engines, simply because they almost never show in any SERPs. Given this you shouldn't be surprised that 150.000 pages or 7.5% of the total didn't receive any traffic. You should be surprised that the majority of the 80% group did.

guarriman3

8:44 pm on Feb 5, 2020 (gmt 0)

You may have it backwards.
The googlebot crawling rate for a page partly depends on how much traffic google search sends to that page.

Thank you very much for your answer.

The case is that my website was penalized in mid-November, but the crawling rate (if we measure it through the "Paged crawled by thay" of Search Console) does not decrease.

[i.imgur.com...] (the showed analyzed period starts on Nov 1, 2019)

As you can see, over the past few months, the crawling rate has not only not declined after the traffic penalization, but has increased in some periods.

So, my question would be: how do I to decrease the number of crawling rate and make Googlebot crawl just high-quality content? Should I remove the vast majority of AMP pages, leaving just the AMP versions of the main pages of the items?

If so, you certainly don't want to noindex the page, or you lose that one human who might otherwise become devoted to your site after discovering that you are the only person who could help them.

Thank you, very interesting thoughts.

In fact, when I created the 'additional' URLs for each item, I was trying to help users to find useful information about "photos of the-item" in an easy way. I wasn't trying to cheat Google, but I've found that the search engine is considering I'm trying to deceive them.

So, the big question would be: could I help users to locate "photos of the-item" if I place the photos at the bottom of the main "the-item" page, instead of on a separate URL? Would Google rank it for the items with higher popularity that are currently well positioned?

In order words, I'd try to 'merge' some information, moving it from the 'separate' URLs to the main one of each item.

Before:
- main item URL, with general info and linking to:
--- URL with photos of the item
--- URL with opinions of the item

After
- main item URL, with general info and including photos+opinions of the item

How could I perform it? With a table of contents plus H2 titles?

pages of which Google has no clue what they are about and for which keywords they should rank.

Thank you. Yes, I'm considering the idea of 'merging' all the pages with low information, in order to make Google think that there are a lower number of URLs, but with higher quality. Does it make sense for you?

tangor

6:57 am on Feb 6, 2020 (gmt 0)

K.I.S.S. ... has many meanings, some of them rude ... but really means Don't Do Something Stupid. Two (or more) urls to the same content is NOT a good signal to any search engine, much less g ...

Improve the content on your landing page that contains ALL the elements you which to "rank for" ... one URL is crawled faster than multiples of the same same ...

Unless I am missing something I'd keep it as simple as possible! G, despite their size and resources, has LIMITED ability to surf the entire web ... and when they encounter a site that is unusual may or may not get the job done.

DDSS (don't do something stupid!)

LESS IS MORE.

Eye on the Prize.

Just the Facts, Ma'am

Lean Machine...

Kendo

7:52 am on Feb 6, 2020 (gmt 0)

I am curiuos... 2 million pages? What is the website about?

RareBit

9:13 am on Feb 6, 2020 (gmt 0)

I always look at it as if I only have a finite amount of 'link juice' ..and the more pages I have the thinner it gets spread throughout my site. Reduce your URLs correctly (via redirects/canonicals ect) and you increase the ranking power of whats left.

JorgeV

10:20 am on Feb 6, 2020 (gmt 0)

Hello-

Just wanted to add that crawl budget, is not only about the number of pages a robot will visit in a given time frame, but also the amount of time it will consume. So be sure that your pages are rendering and trasnferring fast enough. In the "old" GSC interface, you have a page about "exploration", and you'll see the average time it takes for Googlebot to fetch a page.

I recently succeeded to speed up my page by (only) 10% , (which is matter of a couple of tens of ms) , and as a result, Googlebot is visiting 10% more pages too.

guarriman3

1:56 pm on Feb 7, 2020 (gmt 0)

crawl budget, is not only about the number of pages a robot will visit in a given time frame, but also the amount of time it will consume.

A valuable input.

In fact, I started to use CloudFlare some months ago, and the value of "Time Spent Downloading a page" increased automatically from 150 ms to 700 ms. It's a huge rise. And, after reading your comments, I'm realizing that it may be one factor for the SEO ranking erosion.

I've been not able to reduce this value of "Time Spent Downloading a page". I've improved the HTML code and currently virtually all of the pages of my website have 90+ of speed in Google PageSpeed Insights. But the value of "Time Spent Downloading a page" is still around 700 ms. It makes no sense.

I've contacted CloudFlare's staff, but they do not offer me any solution.

I've searching solutions in this forum (https://www.webmasterworld.com/google/4967189.htm), and I see that there are no feasible solutions. I understand that there is a longer path, but the time increase is huge.

levo

4:27 pm on Feb 7, 2020 (gmt 0)

You can make your pages cacheable, or you can try first byte optimizers like Argo.

150ms to 700ms is huge and not normal. Did you recently switched to https? My average is 55ms (with Fastly).

guarriman3

8:25 am on Feb 8, 2020 (gmt 0)

You can make your pages cacheable, or you can try first byte optimizers like Argo.
150ms to 700ms is huge and not normal. Did you recently switched to https? My average is 55ms (with Fastly).

Thank you very much for your answer. Some facts:

My HTML code is cached. I use PHP, but the server creates cached version of the HTML, which are served to the visitors.

I've just done some performance tests with the same HTML code in two different domains, the first behind Cloudflare and the second without Cloudflare, but in the same server. I've done the tests with ByteCheck+Pingdom+GTMetrix. Behind CF: the waiting time is average 350 ms. Without CF: the waiting time is just 70 ms.

I started to use Cloudflare after a DDoS attack in my server. I'm not sure if Cloudflare is the best option for me, I just need DDoS protection and, honestly, I do not know very well how to configure Cloudflare to optimize the performance. I checked Auto Minify for JS+CSS+HTML; I activated Brotli, Enhanced HTTP/2 Prioritization, and Mirage; I did not activate Rocket Loader; the Caching Level is Standard; the Browser Cache TTL is 1 year; 'Always Online' option is activated.

I have SSL with Cloudflare (the SSL/TLS encryption mode is Full). Initially, I was also using https before using Cloudflare, and the SSL certificates are also installed in the origin server. I'm not sure if I must remove them from the origin server.

The Argo option you suggest sounds good. But it is a usage-based product with a cost for me, and I'm not sure if I can enhance the speed by using other options.

I'm aware that these technical issues may be out of the scope of this forum. But I'd like to be sure that, appart from improving my contents quality, the income links, etc, I can improve my Crawl Budget by enhancing the value of "Time Spent Downloading a page" within the Google Search Console.

Any tip will be welcome. Thank you :-)

levo

1:06 pm on Feb 8, 2020 (gmt 0)

- Cloudflare doesn't cache pages by default. If your pages are cacheable - not customized per visitor - you can enable caching for dynamic pages. Be careful, Cloudflare could also cache private pages and serve them to other visitors. And purging is a pain with Cloudflare.

- Make sure your static files include etag or last modified headers, and your server/cloudflare responds 304 to conditional requests. You can also generate etags for your dynamic pages.

- If you have control over your server, check if you've keep-alive enabled.Cloudflare supports up to 900 seconds, but anything over 5 seconds requires advanced configs.

Your latency problems are probably due to connection delays between Cloudflare and your origin server. Caching or keeping the connection alive would mitigate the problem.

BTW, Googlebot crawl rate settings page is hidden but still alive, [google.com...]

guarriman3

10:50 am on Feb 9, 2020 (gmt 0)

Cloudflare doesn't cache pages by default. If your pages are cacheable - not customized per visitor - you can enable caching for dynamic pages. Be careful, Cloudflare could also cache private pages and serve them to other visitors. And purging is a pain with Cloudflare.

The information I provide is entirely public, with no private data, and not customized for each visitor. I've got the data stored in a mySQL DB, and I serve them through PHP. Some of the HTML code are cached (with PHP specific libraries), but I'm interested in knowing how to cache them with CloudFlare in order to help Googlebot to save time and optimize crawl budget.

I've got the 'Pro Plan' in CloudFlare, and within the "Caching" section, I've got: Caching Level = Standard; Browser Cache TTL = 1 year; Always Online = On. Is it enought to cache pages? Shoud I do something else?

Make sure your static files include etag or last modified headers, and your server/cloudflare responds 304 to conditional requests. You can also generate etags for your dynamic pages.

Honestly, I've never explored the possibilities of ETag or the "meta http-equiv=last-modified" for the information I serve. The information I provide is very static, and it's updated every six months, so I guess this will be very useful not to waste resources. Would it be useful the use of "meta http-equiv=Expires" as well?

If you have control over your server, check if you've keep-alive enabled.Cloudflare supports up to 900 seconds, but anything over 5 seconds requires advanced configs.

I've just updated the .htacccess file of my Apache origin server to include

<IfModule mod_headers.c>
 Header set Connection keep-alive
</IfModule>

BTW, Googlebot crawl rate settings page is hidden but still alive

Thank you very much for this tip (and for the rest of the answer). It's very valuable! Should I reduce the crawling rate in order to save crawl budget?

michaeldhayes

4:26 pm on Feb 10, 2020 (gmt 0)

The case is that I created such amount of pages (2M+) by increasing somehow the number of pages, based just on a database of 50,000 items.

Immediate red flag, by spinning up 2 million pages based on combinations of the same content items, you open yourself up to thin content issues. I work on a directory and we face the same issues (i.e. company descriptions used throughout the site).

For each one of the items, I created a different URL: photos of the item, opinions of the item, how to buy the item, etc. Ten URLs per item, reaching different 500,000 URLs for the same domain. Some of the URLs are plenty of information (just because there are opinions, photos, etc.), but 80% of them have litte data, which turns them into pages with very little information.

I think you are answering your own question here, and you know which pages need to be culled.

I will add a little bit of my own opinion here, i.e. Every page you expose to search should:

Do your pages meet those criteria?

My suspicion is "photos of the item" pages don't, since there might not be any textual content on those pages (not to mention people search for product images end up on google image search, and you can optimize for that by having images on the product pages themselves).

Opinions of the item? Maybe, "product name reviews" could very well be a decent target. However you need to balance that with the need to use that unique content on more valuable pages.

Use your own judgment, but I believe following those 3 criteria will lead you in the right direction.

Additionally, I created an AMP version for each URL, and two different languages for the information. So I have around 2,000,000 URLs in the same website. And, as far as I'm starting to understand, I've got a problem with the crawl rate of this website of such great size.

I wouldn't be concerned with AMP and hreflang versions of URLs because they should be getting effectively canonicalized to your original URLs. Google doesn't count these as separate, but simply versions that they'll serve to the appropriate visitors (i.e. AMP for mobile, hreflang for other languages). Assuming this implementation is done correctly, don't throw those URLs out.

levo

8:11 pm on Feb 10, 2020 (gmt 0)

There's no one-size-fits-all answer. Cloudflare has a 'Custom Caching via Cloudflare Page Rules' help page that explains cons and pros.

guarriman3

9:12 pm on Feb 10, 2020 (gmt 0)

by spinning up 2 million pages based on combinations of the same content items, you open yourself up to thin content issues

Thank you very much for your answer. Yes, I suspect that I must cope with thin content issues, which --as far as I'm starting to understand-- is usually linked with the crawl budget.

My suspicion is "photos of the item" pages don't meet those criteria

You're completely right. In fact, the problems started with those pages, that contain just the photos and a few words. Until May 2019, the website had a pretty good SEO ranking, but in that month Google sent me an email warning that AdSense ads could not be displayed on those pages with photos. I remove the AdSense ads, but the website started to be penalized, with decreasing rankings.

Somehow, the penalization of AdSense triggered the penalization in the search engine.

The case is that many visitors search "photos of name-of-the-item", and I return good information of it. They like browsing the photos I show. But I'm considering to:
- remove the pages of the photos from the sitemaps
- include them into the robots.txt
- just try to position the search "photos of name-of-the-item" by using H2 and rich snippets within the main URL of the item, trying to make users to visit the link with the photos.

Meanwhile, at the same time I try to improve other parameters (richer content, better page speed), some questions:
- which tool do you recommend me to cull other weak pages in the website if I don't trust on my own judgment? Screaming Frog? The analysis of the Apache logs through Python?
- is it a good idea to cull specific pages, or is it better to work with a group of them?

Kendo

2:53 am on Feb 11, 2020 (gmt 0)

I wouldn't be concerned with AMP

I'm not going to bother even though I was alarmed that AMP is now recommended for mobile pages.

My pages are being delivered between 1.5 to 1.8 seconds on computers, but Google gives then a mobile rating of 2/10.

What is wrong with their mobile browser?

tangor

6:05 am on Feb 11, 2020 (gmt 0)

@michaeldhayes ... Welcome to Webmasterworld! Always love those who jump right in!

Odd thought ... check your logs for value v visitor and see what is actually going on. This is a primary insight, though for many is a secondary thought if numbers from analytics is the driving force.

You need both.

michaeldhayes

7:56 pm on Feb 13, 2020 (gmt 0)

Somehow, the penalization of AdSense triggered the penalization in the search engine.

This is likely more correlation than causation. The Googlebot sees the new pages, it sets off both adsense warnings and ranking algorithm issues.

The case is that many visitors search "photos of name-of-the-item", and I return good information of it. They like browsing the photos I show. But I'm considering to:
- remove the pages of the photos from the sitemaps
- include them into the robots.txt
- just try to position the search "photos of name-of-the-item" by using H2 and rich snippets within the main URL of the item, trying to make users to visit the link with the photos.

Ding ding ding, you have a winner with #3. Just move the photos to the main URL and make sure they have alt text (and also descriptive filenames if possible).
For the remaining URLs, you should either noindex via robots meta tag or completely remove and have them 404 or 410.

Removing URLs from the sitemap won't remove them from the site or the index. The sitemap is just a single means for discovering URLs, not comprehensive list or directives for how they should be treated.
(However, once URLs are gone, sure, remove them from the sitemap as part of standard housekeeping).

Meanwhile, at the same time I try to improve other parameters (richer content, better page speed), some questions:
- which tool do you recommend me to cull other weak pages in the website if I don't trust on my own judgment? Screaming Frog? The analysis of the Apache logs through Python?

Screaming Frog SEO Spider is the easiest/cheapest option, and you can look at metrics like Word Count to get an idea of thin pages.

Screaming Frog Log File Analyzer is amazing too, if you don't want to do your own analysis via python scripts.

- is it a good idea to cull specific pages, or is it better to work with a group of them?

The most bang for your buck is to look at pages in bulk. In this case, sounds like removingh the photos URLs all at once is a good idea.

As you start to get down to more specific pages, it can make sense to make a judgment call in a per URL basis (such as this product is too thin, not enough search volume, etc, let's throw it out). However I'd start with the large portions first and see if that gets you where you want to go.

michaeldhayes

8:04 pm on Feb 13, 2020 (gmt 0)

@tangor Thanks buddy. I've been around for about a decade but just signed back up. Glad to see it's still alive and well!

Odd thought ... check your logs for value v visitor and see what is actually going on. This is a primary insight, though for many is a secondary thought if numbers from analytics is the driving force.

Yup, log files are definitely an insight. I've found you generally circle the wagons around the same batch of pages whether you look at analytics or bot visits, but it can be good to corroborate.

Not sure what you mean by 'value v visitor', though, if you felt like elaborating.

michaeldhayes

8:06 pm on Feb 13, 2020 (gmt 0)

@Kendo

My pages are being delivered between 1.5 to 1.8 seconds on computers, but Google gives then a mobile rating of 2/10.

What is wrong with their mobile browser?

According to Search Console my site has 0 fast pages, and 70,000 slow ones. Whatever you say, Google. It's hard to take it seriously.

Kendo

11:01 pm on Feb 13, 2020 (gmt 0)

It's hard to take it seriously.

Especially when one gets the impression that they can be penalised for not adopting everything that they devise/propose as good design/tech.

If speed ratings are based on comparisons to using HTTP/2 then most of us will be in trouble, especially if running Windows server prior to version 2016. And most of the compression and mobile support enhancement seems to be readily available for new linux servers, but what about the rest of us?