facebook full crawl - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

facebook full crawl

lucy24

7:14 pm on Jun 21, 2020 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

File under: “Now what are they up to?” because I don’t remember ever seeing this before.

IP: the usual FB ranges
UA: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
robots.txt: YES, apparently compliant
Status: 301 leading to mixed 206 and 200
Requests: most of site, excluding roboted-out directories (but see below)
Headers: identical to everyday FB
Protocol: HTTP redirected to HTTPS

I didn’t do a methodical count, but I tend to think they crawled the entire site, in no particular order, meaning that they’d picked up a shopping list somewhere. Some subdirectories were skipped, by no discernible pattern. In a behavior I’ve come to associate with the DotBot, all requests were initially made with HTTP--including pages that were created after the site went HTTPS.

About 90% of the HTTPS requests got a 206 rather than a 200; if there's a pattern I couldn’t spot it. I thought it might be because of the Range header
Range: bytes=0-524287
but very few of my pages are larger than 512K--probably less than 10%, rather than the 90% implied by the number of 206 responses.

There was one odd exception to the pages-only rule. They also requested a few images ... but only ones that are not displayed with the page by default, such as close-ups or enlargements. And not all of those.

Alternate heading: wtf?

engine

7:16 am on Jun 22, 2020 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Thanks. Is it FB or forged?

JorgeV

9:55 am on Jun 22, 2020 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

Hello,

Do you mean that Facebook is crawling all pages of a sites? ... The only visits I have from Facebook bot, (with verified IP address), are for pages which are effectively shared at Facebook. I assume this is to gather the OG data, and may be c heck for broken links.

Now, if Facebook is crawling "everything", I wonder what's the purpose.

dstiles

10:06 am on Jun 22, 2020 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Over the past couple of months I've noticed hits from FB that are certainly not in their pages. Hits to almost obsolete sites and even test sites that no real person would be interested in. On the other hand, not full scans but what I'd consider probes. I doubt more than a couple of my Apache sites are even interesting to FB participants. I'm considering selective access for the FB bot because of this.

wilderness

1:06 pm on Jun 22, 2020 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

lucy,
Over the years recall a few partial crawls from FB IPs (although did not document) (also recall some requests from a FB IPs while FB UA was absent).
Don't allow FB IPs, thus they get denied (some image requests (refers) continue for months (have one been going on for a year or more) (FWIW, nearly all FB users are one-dimensional. They'll ask a question from other users where the answer is freely available in the SE's).

lucy24

4:43 pm on Jun 22, 2020 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is it FB or forged?

The only way it could be faked is if some FB techie is moonlighting as a botrunner, using their IPs, their UA, their exact headers.

After posting, I continued to look more closely at the list. There might be more to come, as the last batch of requests was only about an hour before I downloaded logs. There were no, zero, requests for removed or renamed files, and requests included pages created as recently as a week ago. But it wasn't a top-to-bottom, follow-all-links spidering. So one part of the head-scratcher is: how do they know where to go?

Edit: I don't want to do a full analysis until I’m sure they are finished, but I pulled yesterday’s full logs (i.e. through about 3AM this morning) from the HTTP site and found many more batches, including repeat requests and many more of those image close-ups. The latter are especially puzzling because the only way you’d know about them is by getting the page and looking for <a href> with an image target. They live in the same directories as the ordinary, displayed-with-the-page images, which were not requested. That's the opposite of the usual FB behavior, where they start by getting a page and all of its associated images.

dstiles

8:33 am on Jun 23, 2020 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If FB is implementing a proper search engine then I'm probably in favour, depending on their motive. However, if they are merely scraping anything online for their own purposes, goodnight externalhit.

engine

8:48 am on Jun 23, 2020 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

If FB is creating search for all, fair enough. I can see why FB might do it as it'll give it access to so much more new data.

I can imagine it following links and data from its own system. After all, it already tracks people surfing, whether a FB user or not.