Does Google cache websites or files?

Forum Moderators: phranque

Message Too Old, No Replies

Does Google cache websites or files?

wondering about googleusercontent hits, blocking them

SumGuy

1:56 pm on Sep 27, 2020 (gmt 0)

I haven't done a ton of research on this, but I have run across some old posts on stack exchange mentioning either a real or imaginary practice of google in caching web content in certain locales to improve access speed, and that hits from googleusercontent might be involved with that.

Whether that's true or not, it leads me to ask:

a) does google cache web content for _routine_ serving if it thinks its doing the user a favor?

b) are there meta tags or commands, possibly in the hosts file, that I can absolutely turn off any and all google caching?

And whether or not googleusercontent is involved with caching, is there any downside at all in blocking all IP's assigned to or reverse to googleusercontent hosts?

phranque

11:32 pm on Sep 27, 2020 (gmt 0)

... a real or imaginary practice of google in caching web content in certain locales to improve access speed, ...

since you mention "googleusercontent", the "cache" you are referring to is likely the cached version of your content that google keeps and links to in the search results unless you have specified a <meta name="robots" content="noarchive"> element in the relevant document head.

maintaining a cached version of your content in "certain locales" may be a reference to google's infrastructure which uses geographically diverse data centers.
this can "improve access speed" by reducing the latency of search requests.
you have no control over this.

there are other googleusercontent subdomains used for other purposes (such as translate.) so be careful that you know what you are blocking.

martinibuster

5:22 pm on Sep 28, 2020 (gmt 0)

Incredibill used to say that some scrapers crawled cached pages in Google as a way to avoid getting banned by websites themselves.

There's documentation to support his contention, too.

[doc.scrapy.org...]

So if you're concerned about scrapers hitting your site via the Google cache that's linked from the SERPs you can use the noarchive meta tag.

SumGuy

1:33 pm on Sep 30, 2020 (gmt 0)

My concerns are:

1) Google serving my site from it's cache (to humans) - I have no way to know it from my log files (presumably this would also happen to many other sites besides mine). Google serving my site files to other bots I have no concerns with.

2) Me blocking all access to large blocks of IP's associated with googleusercontent (based on past requests for garbage like wplogin and other obvious scraping or non-human browsing from those IP's) and is there a downside to that in terms of google's ranking of my site in it's search results.

tangor

11:18 am on Oct 1, 2020 (gmt 0)

So if you're concerned about scrapers hitting your site via the Google cache that's linked from the SERPs you can use the noarchive meta tag.

Since incrediBill (and before) I have used this ... and guess what ...

It sometimes works.

Bot are bots ... even those we generally allow ... but cache is ... forever.

It's an honor system ... and lately I'm not sure all are honoring any declarations, or if one changes one's mind that the third party will DELETE the cached material.

YMMV

We live in interesting times!

SumGuy

2:15 pm on Oct 11, 2020 (gmt 0)

Regarding googleusercontent.com:

Google User Content CDN Used for Malware Hosting
[bleepingcomputer.com...]

That was 2 years ago. I've yet to come across anything indicating that these are google servers that presumably can be rented (like AWS) to run user-code vs just serving "content" (ie mostly images?) but unless I'm missing something I don't see how or why a "content delivery" server would be reaching out to request files on my site. And I'm seeing (and blocking) a ton of these requests on a daily basis. If one were to look for google's commercial offerings, such as renting servers, where would one find this info?

not2easy

2:44 pm on Oct 11, 2020 (gmt 0)

Maybe try searching on Google? Spammers come through here with links to "offers" and "downloads" showing google.drive and googleusercontent.com URLs frequently. Unfortunately people who are not very web savvy might think that URL makes it 'legit'.

This host https://cloud.google.com/solutions/web-hosting shows up for example.

SumGuy

3:14 pm on Oct 11, 2020 (gmt 0)

> Spammers come through here with links to "offers" and "downloads" showing google.drive and googleusercontent.com URLs frequently

Beyond just using it to store/host files, how does someone make use of googleusercontent servers to actively go and make requests to my website?

See also: [support.google.com...]

And from here: "https://www.reddit.com/r/SEO/comments/845242/my_website_had_45000_visits_from_a_state_i_dont/"

is this comment (2 years ago):

"all the visits are coming from googleusercontent.com, all from Detroit, East Lansing, Rockwood, and Dearborn, Michigan. They started hitting us in mid-January with about 4000 visits a week. Now it's 10x that. We are thinking about preventing people who hit our site 5x in less than a second, but that's all we got."

How does someone get granular info about hits from googleusercontent that tells them geo-information like that?

And see also: [searchengineland.com...]

I'm pretty sure that my progressive blocking of googleusercontent IP space comes because of me seeing hits to the ubiquitous wp-login.php and other non-existent files, not for what looks like attempts to cache my site files.

lucy24

4:21 pm on Oct 11, 2020 (gmt 0)

If I search raw logs for the string "webcache" I find requests for supporting files giving
webcache.googleusercontent.com
as referer. That is: all supporting files associated with a particular page.

Case A: the page itself was requested with plain google.com as referer, while the supporting files from the same IP-and-UA referenced

https://webcache.googleusercontent.com/

and-that's-all.
Case B: there doesn't seem to have been a page request at all. Supporting files give

https://webcache.googleusercontent.com/search?q=cache:mBTWPMwosrkJ:https://example.com/ebooks/title/+&cd=1&hl=en&ct=clnk&gl=us

as referer. (The obfuscated part is in fact the page the supporting files belong to.)

That looks like two entirely different patterns, but both from ordinary human IPs. The first of the two--the one that included a direct-to-human page request--was a fairly commonly visited page, although generally not from google. The other was more obscure. But why just these two?

SumGuy

4:49 pm on Oct 11, 2020 (gmt 0)

Do you see (and / or block) hits from googleusercontent IP space? (ie - the host IP rDNS is googleusercontent.com) ? Off the top of my head I believe a good chunk of 34/8 or 35/8 would be an example of that IP space.

lucy24

6:34 pm on Oct 11, 2020 (gmt 0)

Off the top of my head I believe a good chunk of 34/8 or 35/8 would be an example of that IP space.

I have some parts of both 34 and 35 marked as bad_range, meaning that it can be unset for certain distributed robots. But I don't think googleusercontent is from the specific ranges I block; at least I didn't see anyone getting a 403. The two requests I found in logs were both from ordinary human ISP ranges.

The overwhelming majority of requests involving the string "googleusercontent" are from translate.googleusercontent.com, which I don't block. (An obvious YMMV; the ones I see appear to be legitimate humans. That is, ahem, they may personally be right #*$!s, but I believe they're using the site for its intended purpose.)

Edit: Oops. I didn�t realize that noun would be censored. Oh well.

SumGuy

9:58 pm on Oct 11, 2020 (gmt 0)

Looking at my http logs (ie anyone or anything hitting my site on port 80), looking only at log lines that contain "googleusercontent" anywhere on the line (which could mean the referrer or user-agent but turns out I'm only seeing this in the referrer), from 2015 to the present,

- about 450 log-lines in total
- 152 of those lines the hit is from 66.249.x.y (ie - googlebot). X is usually 82 in this case.
- about 300 non-google log lines

If the requesting IP is not google, then:

- the referrer is always this: //webcache.googleusercontent.com/search
- only 2 hits to .html files, the rest are for .gif files
- about 75% of these hits are from 1 source - University of Colorado (2 years ago)

Looking back, there was someone at U-colorado that could not access our site, but we weren't blocking them.

When the requesting IP is google:

- as mentioned above the IP is 66.249.x.y
- the referrer is always //translate.googleusercontent.com/translate_p or _c
- the UA is a typical browser, never googlebot.
- the files requested are .html or .js or .pdf, never .gif
- the last time I see these hits is July 1 / 2019

Since I don't block 66.249/16, then it would appear that I have nothing to worry about in terms of any human using google translate not working if indeed what I'm seeing is someone looking at my site through google translate. Why google chooses to associate their translate function with the googleusercontent domain is not clear to me.

So far today (approx 6 pm) I've rejected port-80 hits from 87 unique IP's. Only 4 are googleusercontent:

34.75.173.223
35.187.23.223
35.231.48.113
35.232.145.68

As opposed to 11 from Amazon (9 of which are coincidentaly also from 34/8 and 35/8). Digital Ocean, Hetzner, Linode, OVH SAS, and China rounding out the top rejects.