Issues Encountered before and after Switch to HTTPS

Forum Moderators: phranque

Message Too Old, No Replies

Issues Encountered before and after Switch to HTTPS

guggi2000

10:02 am on Mar 20, 2017 (gmt 0)

System: The following 44 messages were cut out of thread at: https://www.webmasterworld.com/webmaster/4836604.htm [webmasterworld.com] by phranque - 10:34 pm on Mar 21, 2017 (utc -7)

Do you think this gives a yellow warning in some browsers after switching to Https?

<META property="og:image" content="http://www.example.com/images/logo.png" />

guggi2000

7:47 am on Mar 21, 2017 (gmt 0)

This has nothing to do with turning off the 301. We turned on https, then did 301 and then added the GSC.

As mentioned, the error refers to a date prior to the property creation date and prior to SSL.

We turned off 301 just 5 hours ago because we are not sure what the error is. Simply leaving 301 on to a site that cannot be indexed due to an unreadable robots.txt is a very dangerous thing.

robzilla

8:19 am on Mar 21, 2017 (gmt 0)

Simply leaving 301 on to a site that cannot be indexed due to an unreadable robots.txt is a very dangerous thing.

Assuming something is wrong based solely on (out of date) Search Console data is perhaps more dangerous. And since when do you need robots.txt to get a site indexed? Like a sitemap and a Search Console account, it is optional. You've already given the explanation when you said you "blocked all 443 traffic before going live with the SSL." Apparently Google tried to access robots.txt via HTTPS during that time; that is, before you even had the Search Console account for the HTTPS site (which doesn't mean they don't have any data yet). Once they notice the redirects, they'll refetch your robots.txt file. You don't want to be going back and forth with 301 ("Permanent") redirects.

guggi2000

8:45 am on Mar 21, 2017 (gmt 0)

Assuming something is wrong based solely on out of date Search Console data

Maybe, but we don't know that. The question is whether GSC is just showing old data and Google knows that in the background the robots.txt is actually OK. I am sure that there is a lag, but it should update within 24 hours. I think GSC reflects the real state with a small delay.

And since when do you need robots.txt to get a site indexed

If a robots.txt request is not giving 200 (OK) or 404 (not exists) Google will not crawl your site.

Once they notice the redirects, they'll refetch your robots.txt file

In the robots.txt Tester in GSC you can update the file and ask for re-submission. I believe that redirecting to a site where crawling is blocked is deadly.

Based on previous stats 3% of our site is crawled every day. So 12 hours of redirect is about 1%-2% that was pointed to the https. We decided to take off the redirect and wait until robots.txt shows green or at least we see 1 page indexed in the new property.

Do you think we should redirect back to http or just leave as is for a few days and restart 301 to https when we see the robots.txt issue resolved?

robzilla

9:37 am on Mar 21, 2017 (gmt 0)

If a robots.txt request is not giving 200 (OK) or 404 (not exists) Google will not crawl your site.

Do you have a source for that? As far as I'm aware, the absence of a robots.txt file (regardless of status code) suggests to robots that there are no crawling limitations.

Maybe, but we don't know that

Have you checked your access logs for subsequent robots.txt requests? It will include the status code.

I would not redirect back to HTTP. I would make sure my implementation is correct, then redirect to HTTPS because I don't think there's a robots.txt issue. But if it helps you sleep at night, sure, wait a few days :-)

guggi2000

10:01 am on Mar 21, 2017 (gmt 0)

Robzilla, sleep is very important for your health :-)

access logs for subsequent robots.txt

Actually, that is a very good idea. We will use "robots.txt Tester", then press "Ask Google to update", "Submit" and then check if the GoogleBot arrives... The success message they provide is great but the promised timestamp update within that message does not work.

keyplyr

10:30 am on Mar 21, 2017 (gmt 0)

That was suggested 16 posts ago. Glad to see we're making progress :)

guggi2000

12:11 pm on Mar 21, 2017 (gmt 0)

That was suggested 16 posts ago

@keyplyr Funny, but no, it was not. Please read again carefully.

@robzilla I owe you an answer regarding the source of information, here an extract from "Crawl Errors report (websites)" from Google's support site:

If your robots.txt file exists but is unreachable (in other words, if it doesn't return a 200 or 404 HTTP status code), we'll postpone our crawl rather than risk crawling URLs that you do not want crawled.

lucy24

4:36 pm on Mar 21, 2017 (gmt 0)

As far as I'm aware, the absence of a robots.txt file (regardless of status code) suggests to robots that there are no crawling limitations.

If it doesn't return a 404 (or, theoretically, a 410), how would Google know it's absent? That's the point of distinguishing between "absent" and "unreachable".

:: idly wondering what would happen if you didn't have a robots.txt exemption (for example <Files> in Apache) and hence returned a 403 to requests from some quarters ::

guggi2000

7:35 pm on Mar 21, 2017 (gmt 0)

The dashboard of GSC and some other stats are not updated daily, so that error messages may stay for a while. We are also seeing that Google tried to access the robots.txt through https months ago. This info now appears in GSC as previous failed attempts...

guggi2000

9:36 pm on Mar 21, 2017 (gmt 0)

Our goal: to keep the https version as a 1-to-1 copy of the http.

Example:
A few years ago, we submitted a page (let's call it A) in the sitemap but Google chose to index another similar page instead (let's call it B). This happened for >100 A-B page pairs. We always said that if it is not broken we shouldn't fix it and let Google decide.

Now we have a dilemma: should we add sitemaps and risk "confusing" Google's preferred choice or shall we leave it without sitemaps and do the push through 301 only.

Any thoughts? Has anyone experience with moving to https without sitemaps?

lucy24

9:41 pm on Mar 21, 2017 (gmt 0)

We are also seeing that Google tried to access the robots.txt through https months ago. This info now appears in GSC as previous failed attempts...

That makes sense now that you say it. If a site isn't publicly linked as https, the only way for Google to know an https version exists is to try it. And, since they've never visited before (this is assuming a different protocol is equivalent to a different hostname, just like with-and-without www), obviously they have to start by requesting robots.txt. And then this information is stored for future reference. They don't know that your https didn't exist until five minutes ago; they only know that at some time in the past they tried https://example.com/robots.txt and got some type of error. If the server wasn't listening on the appropriate port at all, there would be no numerical response logged at your end; Google would just record a time-out error.

guggi2000

9:52 pm on Mar 21, 2017 (gmt 0)

@lucy24 Correct. And this is the reason we turned off the 443 port once we installed the SSL certificate for testing. We were afraid Google would check the https site, pick it up and start indexing before we were ready. And that's what happened.

robzilla

11:34 pm on Mar 21, 2017 (gmt 0)

Thanks for the reference [support.google.com], guggi2000, I wasn't aware of that.

If it doesn't return a 404 (or, theoretically, a 410), how would Google know it's absent? That's the point of distinguishing between "absent" and "unreachable".

But "unreachable" doesn't mean it exists, either. But apparently they'll assume it exists because, as you say, it doesn't return 404 or 410.

So... what's this thread about again? :-)

lucy24

12:35 am on Mar 22, 2017 (gmt 0)

Hypothesis: If you create a GSC account for your brand-new https site, adding it to (or changing from) an existing http site, it will immediately list a robots.txt error dating from some time in the recent past. So it's just another of the GSC/WMT non-errors that you can safely ignore 99% of the time.

Edit: I don't see any need for a sitemap. Once Google learns that the site is accessible by https--you could give them a nudge by requesting a few pages at GSC--they will run wild and re-crawl everything, same as they do with a new site or new directory. People who have made the change will probably see the flurry of search-engine activity in their logs. Or maybe they already have. I seem to have missed about five pages of this thread.

guggi2000

7:43 am on Mar 22, 2017 (gmt 0)

@Lucy24 I was also told by another source to ignore "GSC/WMT non-errors" and to focus on the URLS indexed under the sitemap tab. It's basically the only way to know with minimal delay and it's the only reason I have for a sitemap.

Do you think a partial sitemap can hurt? Just to check if Google can grab a few pages and let Google do the rest?

In a way I feel, that if we give a full sitemap Google will index the new URLS quicker but that does not mean that the link juice from the old URLs is forwarded. If we do not give sitemaps, the URLs will be found slower and the 301 itself may trigger the indexing, not merely the sitemaps.

Does that make sense?

guggi2000

8:14 am on Mar 22, 2017 (gmt 0)

Change of 301 Rule:
We changed the 301 rewrite rule to allow access to robots.txt on the old HTTP and all the rest goes to HTTPS


RewriteCond %{HTTPS} !=on
RewriteRule !^robots\.txt$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

In your opinion is this correct?

We want to add also "sitemaps.xml" to stay on the http. Any idea how to achieve that?

phranque

9:41 am on Mar 22, 2017 (gmt 0)

In your opinion is this correct?

not enough information to answer that:
what did the ruleset look like before?
do you accept requests for non-canonical hostnames (IP address or www vs non-www)?
if so what does the canonical hostname look like?

do you have other RewriteRules?

We want to add also "sitemaps.xml" to stay on the http. Any idea how to achieve that?

!^(robots\.txt|sitemap\.xml)$

guggi2000

10:15 am on Mar 22, 2017 (gmt 0)

@phranque Thanks for the sitemap addition

The rule set before was standard, basically forwarding all requests...

RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

And yes, we do have other rules that forward non-www to www, etc...
We tested with the sitemap addition and it seems to work. I think it is important not to forward robots.txt

robzilla

10:48 am on Mar 22, 2017 (gmt 0)

I think it is important not to forward robots.txt

Why? I forward everything. Never been an issue. I feel it makes more sense to drop HTTP entirely than to awkwardly combine the two.

Hypothesis: If you create a GSC account for your brand-new https site, adding it to (or changing from) an existing http site, it will immediately list a robots.txt error dating from some time in the recent past.

When you add a new site to the Search Console, one of the few data points directly available is the last robots.txt fetch (if there is one).

guggi2000

11:11 am on Mar 22, 2017 (gmt 0)

I think it is important not to forward robots.txt ... Why? I forwarded everything

A. Because sometimes they are not the same file
B. Make sure GSC does not drop http robots.txt because it is on another property (or domain). They should be smarter than that, but who knows.

guggi2000

11:12 am on Mar 22, 2017 (gmt 0)

By the way, why do I need to add the non-www as properties in GSC? I know Google recommends, but why? It's and has been 301ed to the www for years...

Thanks

phranque

12:49 pm on Mar 22, 2017 (gmt 0)

By the way, why do I need to add the non-www as properties in GSC? I know Google recommends, but why? It's and has been 301ed to the www for years...

- you want to make sure google isn't indexing paths on the non-www property.
- you want to make sure google isn't reporting errors on the non-www property that would prevent sending a proper response to a googlebot request.

phranque

12:59 pm on Mar 22, 2017 (gmt 0)

we do have other rules that forward non-www to www

what does your non-www redirect look like?
typically you can combine the protocol canonicalization and hostname canonicalization rewrites in one ruleset.
also you typically want to specify the canonical hostname on the substitution string of the RewriteRule to prevent unnecessary chained redirects.

something like this:

RewriteCond %{HTTPS} !=on [OR]
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule ^(.*)$ https://www.example.com/$1 [R=301,L]

[edited by: phranque at 4:21 am (utc) on Apr 1, 2017]

guggi2000

1:43 pm on Mar 22, 2017 (gmt 0)

@phranque Yes, it's a good idea to combine the condition. We haven't done it yet. In general the goal is to make the htaccess as short as possible for better performance.

Another question that I have raised earlier:
In the context of an https move:
a. Is there a risk in not providing a sitemap if the entire site's structure is good?
b. Is there a risk in providing a partial or full sitemap? Google may have decided to drop certain pages over the years (for good reasons) and we try to suggest something that may confuse the search engine?

adamxcl

2:56 am on Mar 25, 2017 (gmt 0)

I just did the change to https early this month. I held off about as long as I could but I started to get the occasional email asking/complaining that my site wasn't secure. I tried to blow them off for a while but then a update to a service that I was using (that also partnered with Apple Pay) required https for the best presentation. It looked great on https sites but it looked like crap on http. So I decided, wtf. It was a stressful few days but the consoles in Firefox and Chrome became good friends. Each pointed out slightly different mixed content errors. So I kept plugging away for days until I got them both to say I was perfect.

Usually my season goes up as the year goes on. I really don't know the impact for certain but I will state a surprising fact. In the first 10 full days of https in March, I set a month's record for user signups. My record month is always the same later in the year. So I not only hit it 4-5 months early, I shattered my record in 10 days. So right now, I am thinking it gave more confidence to people providing name and email. Most other sites in my space are not https so I am standing out a bit for now.

Google immediately crawled a ton the first day, not much the second and then hammered me again on the third day. The charts in the Search Console showed a big decline in the old and a steep incline for https. 3 weeks in, things are still decent, going up and I am seeing some higher rankings than I have in years. That's great but I am not counting on it yet. Wait and see. But my previously great Bing rankings seem to have taken a dive. Like from page 2 to page 9 on some KWs. Like I said, it's early so who knows how things will settle out.

phranque

8:38 am on Mar 25, 2017 (gmt 0)

Is there a risk in not providing a sitemap if the entire site's structure is good?

you won't have access to the GSC tools to Manage sitemaps with the Sitemaps report:
https://support.google.com/webmasters/answer/183669 [support.google.com]

i've used a sitemaps index file [support.google.com] to submit several hundred thousand urls and the reports can be helpful.

guggi2000

5:37 am on Mar 31, 2017 (gmt 0)

System: The following 8 messages were spliced on to this thread from: https://www.webmasterworld.com/webmaster/4836604.htm [webmasterworld.com] by phranque - 4:04 am on Mar 31, 2017 (utc -7)

Is it normal to have Google crawl every page of our site once or twice per day? Let's say Total pages = 100,000... Pages crawled per day = 200,000 ?

[edited by: phranque at 11:07 am (utc) on Mar 31, 2017]

keyplyr

5:47 am on Mar 31, 2017 (gmt 0)

guggi2000 - that sounds abnormal. Are you sure it's really Googlebot? It's the most faked user agent ever.

Your stats report may say it's Googlebot but unless it comes from the Googlebot crawl IP range it is not authentic. Only way to be sure is to manually look at your raw server logs.

Googlebot will only crawl from this range:
66.249.64.0 - 66.249.95.255
66.249.64.0/19

More discussions about User Agents can be found here:
[webmasterworld.com...]

guggi2000

6:26 am on Mar 31, 2017 (gmt 0)

@keyplyr the stats are from Google Search Console.

keyplyr

6:50 am on Mar 31, 2017 (gmt 0)

You sure that number is specific to pages? More likely all files. Either that or you may have some redundant forwarding going on.

This 69 message thread spans 3 pages: 69