Forum Moderators: open

Message Too Old, No Replies

apple-app-site-association

         

lucy24

6:27 pm on May 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I recognize the name, but can't find a discussion (and where but here could I have read it?):

After switching my personal site to https, I find recurring Googlebot requests for this trio of files:

/apple-app-site-association
/.well-known/apple-app-site-association
/.well-known/assetlinks.json

Well-known it may be to them, but I'm stumped.

The HTTPS may be a red herring and they'd do the same for any new site. I find a flurry of requests for /apple-app-site-association (only) in September-October 2015 across all sites. Or, at least, all sites known to GSC.

keyplyr

9:33 pm on May 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you sure it is Googlebot or is the range other than a Googlebot crawl range, for instance Google Proxy?

I get requests for those files as well, but not from Googlebot. These are usually fishing expeditions.

".well-known" is a file that many shared hosts (ah em) put in accounts to facilitate the switch to HTTPS. Googlebot, unless faked, should not be requesting it IMO.

lucy24

10:36 pm on May 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you sure it is Googlebot

Sample from 2015:
66.249.67.180 - - [29/Sep/2015:15:30:32 -0700] "GET /apple-app-site-association HTTP/1.1" 404 1432 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
although now that I look more closely, there were also some requests from--the plot thickens--the Applebot at 17.blahblah. Only back then; not now.

Sample from this month:
66.249.64.188 - - [14/May/2017:05:37:08 -0700] "GET /apple-app-site-association HTTP/1.1" 404 6256 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
I'd call that the Googlebot, wouldn't you?

I re-checked ".well-known". Nobody except the recent days' Googlebot. Besides, fakers would have been blocked two different ways; most of the time I only notice 404s.

".well-known" is a file that many shared hosts (ah em) put in accounts
Not mine, evidently, since Googlebot didn't find one. (Leading . in a directory name? Even apache dot org doesn't do that.)

Oh well. Back to futile endeavors to (a) make cat understand that I did not cause the rain, and cannot stop it, and (b) figure out what, exactly, Eliza Acton means by “udder”, and (c), last but not least, try to remember where I read an explanation of the site-association thing.

Edit:
I am morally certain I did not read it here first (Option B is That Other Forum, which is also unlikely) but here's the horse's mouth explanation [developer.apple.com]. It doesn't, of course, explain why either Google or the Applebot was looking for it at those precise times.

keyplyr

10:53 pm on May 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you are using Let's Encrypt, the .well-known file facilitates installation on your account if using shared server environment:
'Let's Encrypt' adds a .well_known/ folder to your site to authenticate the certificate.
source: help.dreamhost.com

Not saying you should have this file, just saying what it is and why a bot may look for it. Starting with a dot hides it of course. Why Googlebot is asking for it remains a mystery.

For a while now Googlebot has been picking up file paths from the dark web where unknown malevolent entities reside.

lucy24

12:27 am on May 17, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Starting with a dot hides it

There's a prefs setting in Fetch to show or not show things with leading . dot. I just re-checked: there aren't separate options for files and directories. (Unlike, say, generic access controls that deny direct access to files with a leading dot, but say nothing about directories.) I've always gone with the "show" option, because how else would I see my htaccess files?

Maybe Let's Encrypt adds it temporarily and then takes it away again. I'm reasonably sure that if I can't see it, it doesn't exist.

Oh, and -- D'oh! -- the top level of my userspace has things like a .php directory which in fact starts with a dot. I've seen it dozens of times.

Further tangential: When Google first learned that this site could be reached by https, their full top-to-bottom crawl included at least two roboted-out directories. (It did not include a page I'd deleted only a week or so earlier: this was a start-from-scratch spidering, not https applied to familiar URLs.) They only requested things that they legitimately knew about, via links on other pages. Luckily, in approved belt-and-suspenders mode, every individual page in the /boilerplate/ directory--mostly things like error documents--has a noindex meta, so I need not worry that the content of my Legal page will suddenly show up in searches. And robotic requests for anything in /piwik/ are categorically blocked. Which, ahem, serves them right.

Edit: The Googlebot's first visit to the https site was precisely 1 minute and 10 seconds after my own first visit. I didn't add it to GSC until two days later, so how did they find out so fast? Do they talk to the certificate people?

keyplyr

12:36 am on May 17, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you are in fact using a Let's Encrypt SSL certificate, they say to add this to your htaccess file:
RewriteRule ^.well-known/(.*)$ - [L]
...to allow the .well-known/ directory to be installed, which (upon further reading) is necessary to update the cert's key every 90 days.

phranque

12:55 am on May 17, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



(Leading . in a directory name? Even apache dot org doesn't do that.)

this is a unix convention.
directory and file names starting with a dot are ommitted from ls command results by default.
the -a argument overrides this default.

https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
In Unix, a leading period means the file or folder is normally hidden.

phranque

12:58 am on May 17, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



this is boilerplate for most http.conf files:
#
# The following lines prevent .htaccess and .htpasswd files from being viewed by web clients.
#
<FilesMatch "^\.ht">
Order allow,deny
Deny from all
Satisfy All
</FilesMatch>

lucy24

3:20 am on May 17, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



this is boilerplate for most http.conf file

Yes, that was my point: there's a generic block on files beginning with .ht (or, if you so choose, any . at all), but nothing equivalent for directories.

they say to add this to your htaccess file:

RewriteRule ^.well-known/(.*)$ - [L]
Over my dead body ;) Now, maybe
RewriteRule ^\.well-known/ - [L]

...to allow the .well-known/ directory to be installed, which (upon further reading) is necessary to update the cert's key every 90 days.

I don't perfectly understand how the quoted RewriteRule--which simply says “If the request is for anything in the /.well-known/ directory, then take no further mod_rewrite action"--would have any effect on installing, let alone writing to, the directory. Especially since the rule's effect depends entirely on where exactly it is located. The intention seems to be to put it first, though they never say so--but that would still have no effect on actions taken by other mods. (And what on earth does this directory have to do with apple-app-site-association? Only Google introduced the directory motif; Apple looked for it at the root.)

Besides, I'm quite sure the config file categorically blocks all PUT requests (logged responses are variously 403 or 405). So they'd already need to override the site's ordinary access controls. Which the host, of course, is perfectly able to do whenever they feel like it.

:: detour to host's Information area, which is where the wiki bookmark now takes me ::

Hmm.

Nahh. I think if there's a problem I'll see a form letter. They did say that they start with a self-signed certificate while waiting for the real thing to propagate, so We Shall See. And in the meantime I could see that Firefox was absolutely ecstatic when I went into piwik by https instead of http :)

:: vague mental picture of Bob Ross painting happy little green padlocks ::

Back to Weirdest Reality Show Ever (on PBS).

keyplyr

3:27 am on May 17, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm just passing along published information from the hosting company and the certificate company. If you have disagreements with how they are writing their code, I'm sure they'd be tickled to hear from you :)