Forum Moderators: phranque

Message Too Old, No Replies

Anyone getting reports of existing pages not found by Google?

         

Kendo

8:14 pm on Jun 5, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Anyone else getting ridiculous reports about pages not being found by Google?

It has been going on for some time now. Every now and then Google spews out a list of pages that it cannot find on my site. I am managing about 30 web sites spread across 2 servers. But even though they are separate websites and each has a unique domain name, Google is looking for pages that do not exist. The most obvious example /wp-admin/ which only exists on 1 in 30 sites.

Now I do recall a little while back that Google was starting to treat all aliases of a domain as the same site, which may seem clever to complete a idiot, but only one of my domains are using aliases.

I have been pondering about how such a thing can happen and wondering if someone has scraped one site and somehow submitted it as all of my other sites. But that would be impossible, right? Or it could be that getting indexed is a lot easier than what everyone thinks... if people are typing bad links into their address bar and Google is spidering those pages as if they exist, then we have a huge problem, and one that can damage everyone's site reputation. In the past I have found my test pages through search of their content... pages not linked to web and only visited once or twice by members of my team.

Other common errors are about canonical links... http vs https. All sites are https and submitted as https. In fact most redirect to https. So why would Google be spidering http pages and then complaining about canonical mismatches?

Or are these just more examples of Google's "superior" but broken technology?

[edited by: not2easy at 2:59 pm (utc) on Jun 6, 2024]
[edit reason] split thread cleanup [/edit]

not2easy

3:40 pm on Jun 6, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



most redirect to https
If you have not got canonical 301 rewrite rules to ensure that only one version of each URL can be accessed, you would have some significant indexing problems.

Google sees these:
http://example.com
https://example.com
http://www.example.com
https://www.example.com
as four different but identical sites.

Do some testing, look at your logs and you'll know whether this is a problem for you.

Kendo

12:55 am on Jun 7, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On the site in question I use https://example.com only... all links point to that and all pages have canonical for that, and have for years. There is no reason to index anything else. In fact if they did remove those links from their indexing like they have been claiming every month, then it would not be an ongoing annoyance.

My main concern is why are they looking for so many pages that never existed?

The site in question is 26 years old and it has never ever used WordPress yet I am getting reports about WP pages missing! Why would it think those pages exist? How did they get indexed in the first place?

not2easy

3:56 pm on Jun 7, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



So why would Google be spidering http pages and then complaining about canonical mismatches?

It is good to declare the canonical in your pages, but the canonical that matters for those http://example.com pages is the domain canonical in your .htaccess file so that those other versions of the domain cannot be accessed. Google would not be able to crawl the http://example.com pages.

See the Apache forum: [webmasterworld.com...] to learn more.

Kendo

5:45 am on Jun 8, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the canonical that matters for those http://example.com pages is the domain canonical in your .htaccess file
On Windows server we use web.config to get the same results.

But my point is that there is no http links possible on the site in question. Google reports have been spewing out nonsense. Similar nonsense has been Gmail bouncing mail due due to lack of SPF records, yet they do exist. It delivers mail without hiccups and every now and then starts bouncing it.

Where does it get non-existent links from?

lucy24

5:03 pm on Jun 8, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Although G### isn't the worst offender, legitimate robots will sometimes request http://example.com/page-that-never-existed-as-http. (In this respect I prefer the behavior of Russian search engines: once they see that a site has gone https, they will never again request anything but the root / as http.)

Kendo

9:58 pm on Jun 9, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can understand why bots might be looking for "wp-admin" but why would Google be complaining about 404s when it cannot find them... sending a report about those pages no longer being indexed. Like how did they ever get indexed in the first place when WordPress has never been used on that site, not in the 26 years that it has existed?

I have always suspected and often found it to be true that while everyone believes that it is difficult to get pages listed and that deliberate efforts are required to get that done, that Google will index absolutely every page that gets requested by their web browser. That means that a lot of bot traffic that is probing for avenues of exploit is getting added to our indexes... how is that good for SEO and reputation?

lucy24

6:11 am on Jun 10, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I recently stopped by GSC for some reason which now escapes me. While there, I found a flurry of reported 404s, which on closer inspection turned out to be legitimate URLs with /1000 (really) appended to them.

Google, I am not responsible for your computer’s hiccups, and it is not my problem if you’re following other people’s bad links.

I ignored the whole thing.

explorador

11:05 pm on Jun 18, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yesterday, one of my sites that hasn't changed in years, got the report.

Supposedly, the page: "this_website_is_really_awesome.html" does not exist, and that's true.

The page is actually .htm, not .html

lucy24

3:59 pm on Jun 19, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This may be the same phenomenon as requesting "directory/subdir" when the URL is really "directory/subdir/"--except that the latter is handled by mod_dir so I don't need to deal with it--which G### does constantly. Sorry, G, but I am not going to change my URLs just because you enjoy requesting extensionless. (Especially when they are real, physical directories.)

If they make a habit of it, I guess you could put in a redirect. But then you worry that by acknowledging its existence, you’re “owning” the bad URL.