Forum Moderators: Robert Charlton & goodroi
Is it worth inserting a "<meta name=robots content=noindex>" tag in each one of the pages with no visits?It would hardly make a difference, since the pages aren't getting visited anyway. (I'm assuming you meant no human visitors. Search engines always visit sooner or later, at intervals of much less than six months.)
(...) but 80% of them have litte data, which turns them into pages with very little information.
You may have it backwards.
The googlebot crawling rate for a page partly depends on how much traffic google search sends to that page.
If so, you certainly don't want to noindex the page, or you lose that one human who might otherwise become devoted to your site after discovering that you are the only person who could help them.
pages of which Google has no clue what they are about and for which keywords they should rank.
crawl budget, is not only about the number of pages a robot will visit in a given time frame, but also the amount of time it will consume.
You can make your pages cacheable, or you can try first byte optimizers like Argo.
150ms to 700ms is huge and not normal. Did you recently switched to https? My average is 55ms (with Fastly).
Cloudflare doesn't cache pages by default. If your pages are cacheable - not customized per visitor - you can enable caching for dynamic pages. Be careful, Cloudflare could also cache private pages and serve them to other visitors. And purging is a pain with Cloudflare.
Make sure your static files include etag or last modified headers, and your server/cloudflare responds 304 to conditional requests. You can also generate etags for your dynamic pages.
If you have control over your server, check if you've keep-alive enabled.Cloudflare supports up to 900 seconds, but anything over 5 seconds requires advanced configs.
<IfModule mod_headers.c>
Header set Connection keep-alive
</IfModule> BTW, Googlebot crawl rate settings page is hidden but still alive
The case is that I created such amount of pages (2M+) by increasing somehow the number of pages, based just on a database of 50,000 items.
For each one of the items, I created a different URL: photos of the item, opinions of the item, how to buy the item, etc. Ten URLs per item, reaching different 500,000 URLs for the same domain. Some of the URLs are plenty of information (just because there are opinions, photos, etc.), but 80% of them have litte data, which turns them into pages with very little information.
Additionally, I created an AMP version for each URL, and two different languages for the information. So I have around 2,000,000 URLs in the same website. And, as far as I'm starting to understand, I've got a problem with the crawl rate of this website of such great size.
by spinning up 2 million pages based on combinations of the same content items, you open yourself up to thin content issues
My suspicion is "photos of the item" pages don't meet those criteria
Somehow, the penalization of AdSense triggered the penalization in the search engine.
The case is that many visitors search "photos of name-of-the-item", and I return good information of it. They like browsing the photos I show. But I'm considering to:
- remove the pages of the photos from the sitemaps
- include them into the robots.txt
- just try to position the search "photos of name-of-the-item" by using H2 and rich snippets within the main URL of the item, trying to make users to visit the link with the photos.
Meanwhile, at the same time I try to improve other parameters (richer content, better page speed), some questions:
- which tool do you recommend me to cull other weak pages in the website if I don't trust on my own judgment? Screaming Frog? The analysis of the Apache logs through Python?
- is it a good idea to cull specific pages, or is it better to work with a group of them?
Odd thought ... check your logs for value v visitor and see what is actually going on. This is a primary insight, though for many is a secondary thought if numbers from analytics is the driving force.
It's hard to take it seriously.