At Home with the Robots: 2015 Edition It's been two years. Time to see what the robots are up to.
2012 edition [webmasterworld.com]
2013 edition [webmasterworld.com]
I skipped last year because I moved sites in late December and the search engines were still in "what the ### is going on?!" mode.
The Good... 2. We Try Harder. Why have I put #2 before #1? Because the bingbot was, once again, more active than the googlebot-- almost half again as many requests overall. But unlike past years, this was not because of a morbid appetite for robots.txt. In the entire month of January-- hold on to your hats-- the bingbot only requested robots.txt 60 (sixty) times. And even this figure is misleading. On a couple of days they read robots.txt up to 5 times, much like the bingbot of old. To make up for it, there were spells when they went over 48 hours without a single robots.txt request.
bingbot
IP ranges: 157.55, 207.46
Some formerly popular ranges seem to have disappeared: 65.52 was rare after January 2014 (a year ago); in the past year, 131.253 seems only to be used for WMT (site verification).
UA: Mozilla/5.0 (compatible; bingbot/2.0; +http:/
/www.bing.com/bingbot.htm)
bingbot mobile
This is a brand-new UA. It first showed up on this site on 15 January, halfway through the very month I was looking at. IP ranges: 157.55, 207.46 (same as ordinary bingbot)
UA: Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; http:/
/www.bing.com/bingbot.htm)
and
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http:/
/www.bing.com/bingbot.htm)
Can you spot the difference? Bing-- by any name-- has always had issues with double spaces.
Behavior: Unlike the more familiar mobile Googlebot, this new mobile bing gets all kinds of files-- images as well as pages. On the other hand, it is never used for robots.txt requests. I don't know how this would work if you wanted to set separate rules for the mobile bingbot, since there's nothing readily distinctive in its name.
So far, this UA is rare: only about 1/25 of all bing requests. I will expect it to become more common, though.
Punch line: Yup, the bingbot is using an iPhone UA. Har de har har.
msnbot-media
IP: 65.55 only, but its visits were so rare, I don't know if I can lay down a rule
UA: msnbot-media/1.1 (+http:/
/search.msn.com/ msnbot.htm)
Behavior: Same as two years ago. Robots.txt, one image file, that's it.
msnbot
Enjoying its retirement. I haven't set eyes on it since last August, and even then it primarily asked for the robots.txt-plus-sitemap combo.
Bing Preview
IP: 199.30, 65.55
UA: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
and
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b
Behavior: The odd feature of these UAs is that they
never request a page, and only occasionally a stylesheet. In general they request only the supporting files (images, fonts, scripts) belonging to that page-- which they give as referer, just like a human. From this you have to conclude that they're requesting the supporting files invoked by any given page
the last time they crawled it, which may or may not be identical to what the page uses today.
The iPhone preview showed up at mid-month, at pretty exactly the same time as the mobile bingbot.
Plainclothes bingbot
IP: 65.55, 131.253, 157.56
UA: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
Behavior: If I'd thought of it, I would have unblocked them to see if there was any change in behavior. But I didn't, so I can't. They requested assorted pages, each time accompanied by the supporting files that a human would have got from the 403 page. But they never requested the favicon (not explicitly linked from any page) the way a bona fide human would have done. Requests always included the with-script version of piwik (analytics), though I didn't let them have it.
It's been several years and I remain at a complete loss what the plainclothes bingbot is for. I doubt it's humans surfing on their own time in Redmond ... especially since one request this month came in at 4:30 AM, and two others on a Saturday.
bing site authorization
IP: 131.253.38.67 (always this exact IP)
UA: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 7.1; Trident/5.0)
This IP-UA combination is only used for requesting the /BingSiteAuth.xml file that goes with wmt.
1. We're Number One! How does it go again? First in war, first in peace, last in the American L-- Whoops, no, I'm thinking of something else.
IP for all activities: 66.249.64-95
Two formerly well-known Google IPs seem to be on hiatus. I last saw 74.125 in May of 2013; 72.14 was last seen in July 2014, and then only for site verification (wmt).
Behavior: Not exactly new, but I only just noticed it. The Googlebot-- including mobiles and images-- follows all redirects within the second, almost like a human. The only exception is when they've already crawled the new URL within the past hour; then there's no repeat request.
I notice a fair number of requests for some-garbage-string.html, ending in a 404 response. I believe this is programmatically triggered any time a site yields an unexpected number of redirects; they're checking for Soft 404s.
Googlebot
UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http:/
/www.google.com/bot.html)
Filetypes: html, css, js, pdf
Behavior: Unlike some past years, the googlebot never requested an image this month. It may not have stopped entirely, though; I found one request as recent as December 2014. Sometimes it does still send a referer when requesting .css or .js files.
New quirk: Late last year, the googlebot took it into its head that URLs in one directory contained a double slash, like /directory//subdir/pagename.html. I've never pinpointed the reason, but they're contentedly following redirects.
mobile Google
UA (see below): Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http:/
/www.google.com/bot.html)
and
SAMSUNG-SGH-E250/1.0 Profile/ MIDP-2.0 Configuration/ CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http:/
/www.google.com/ bot.html)
and
DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http:/
/www.google.com/ bot.html)
Note the first form. Up until February of 2014, the UA for their most common mobile crawler specifically said "Googlebot-Mobile":
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http:/
/www.google.com/ bot.html)
After February of 2014, with a very brief overlap, they changed to the current UA string with "Googlebot/" instead of "Googlebot-Mobile/". At the same time, they jumped from iOS 4 to iOS 6.
The assorted mobiles together make up about 1/6 of all Google requests. Unlike mobile bing, these UAs
only request pages. I find it interesting that they never request scripts or stylesheets. Do they look at my @media rules and conclude that everyone gets the same css?
Googlebot-Image
UA: Googlebot-Image/1.0
Behavior: Most requests get a 304 response. I don't know if this is a reflection of differing request headers or my own server's behavior. The same applies to pdfs, which are requested by the regular Googlebot.
blank.html
I should mention this here, because it seems to play a role in mobile image search. Requests come with the referer http:/
/www.google.tld/blank.html. Current UAs are in no way limited to mobiles, though.
Google Favicon
UA: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google favicon
This new UA featuring FF 6 showed up in November of 2013. I guess it's an improvement over the old one with the blank UA. Concurrently they changed from the original 74.125 IP to the same IP as all other Googlebots.
Behavior: No special treatment; that FF 6 UA gets them redirected straight to an old-browsers page. Fortunately this doesn't prevent them from getting the favicon, which after all was the real purpose of the visit. They did this a total of 15 times during the month. Don't know what they did with them all; do many sites change favicons every other day?
Google Preview
IP: 66.249 (same as crawl) and 64.233
UA: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36
The 64.233 IP appears to have been on an extended vacation
:: insert witticism about Matt Cutts here ::
It went away in January 2012 and didn't reappear until April 2014.
Behavior: As far as I can tell, there is no difference between the Preview triggered by a human search-- assuming this still exists?-- and the Preview you get in WMT.
Tip: In logs, you can easily tell which ones are WMT previews, because the package will include one redirected request for the front page. That's your domain-name-canonicalization redirect at work.
Google Leftovers
UA: various
Search me. The AppEngine is happily not as much in evidence as it used to me. I see some GoogleImageProxy and a couple of MSIE 8 visits, with Google IP but giving google.pl as referer. They're not Translate; those come with an X-Forwarded-For header. Pending solid information, I've been blocking any request from a Google IP that contains neither "Google" in the UA nor a Forwarded header.
3. Back in the USSR Oops, er, I guess it's Russia. (Also Turkey. Yandex seems to be big there too.)
IP: 100.43.91.18,
rarely 5.255, 37.140, 100.43.something-else, 178.154
UA: Mozilla/5.0 (compatible; YandexBot/3.0; +http:/
/yandex.com/bots)
Other IPs and UAs seem to have retired. They've been crawling from the identical IP (down to the last digit) since February 2013; I haven't seen their US range 199.21 since last June. This month, more than 98% of their visits were from the same IP.
The YandexImages UA hasn't been around since last fall; YandexFavicons last showed its face in November 2014, though it was always rare. The same YandexBot from the same IP now gets all filetypes. As with Google, image files (but not pdfs) tend to receive a 304.
Seznam SeznamBot
IP: 77.75.73, 77.75.77
UA: Mozilla/5.0 (compatible; SeznamBot/3.2; +http:/
/fulltext.sblog.cz/)
They seem to have changed UAs pretty exactly a year ago; it used to be
SeznamBot/3.0 (+http:/
/fulltext.sblog.cz/)
Version 3.1 must be experimental, like Apache; I've never seen it.
Behavior: When they request images, it tends to be in large batches all at once. Unlike the Big Three, they don't immediately follow redirects.
Seznam Preview
IP: 77.75.77.123 (worth noting because, although I only saw them once this month, they used the identical IP two years ago)
UA: Mozilla/5.0 (compatible; Seznam screenshot-generator 2.1; +http:/
/fulltext.sblog.cz/screenshot/)
exabot IP: 178.255.215.77 (exactly: Exabot), 178.255.215.89 (exactly: BiggerBetter)
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http:/
/www.exabot.com/go/robot)
and
Mozilla/5.0 (compatible; Exabot/3.0 (BiggerBetter); +http:/
/www.exabot.com/go/robot)
These two UAs, each with their own dedicated IP, have operated in tandem for at least two years. I think they're popular in France.
Mail.RU (their casing, not mine)
IP: 217.69.133
UA: Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http:/
/go.mail.ru/help/robots)
Behavior: For several years I blocked them from all images due to unsavory behavior, very much like the worst kind of Ukrainian robot. This seems to have ended pretty suddenly in February of 2013, to be replaced with ordinary search-engine-like crawling. This month they didn't request any images at all. Like most search engines except Seznam, they follow redirects pretty promptly.
DuckDuckGo favicons IP: 107.23.45.196
I list them here only because DDG is a respectable search engine. But this isn't quite respectable behavior; in fact I had no idea it was happening until this month, because normally I ignore 403s unless they're part of a botnet I'm tracking.
107.23.45.196 - - [02/Jan/2015:08:23:28 -0800] "GET / HTTP/1.1" 403 3357 "http://example.com/" "Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)"
... a total of 22 times in the course of the month. Unfortunately this got them a double-barreled lockout: first because of the IP (part of an AWS /14) and again because of the auto-referer. Perversely, I would have been happy to let them have the favicon-- same as with Google's faviconbot-- but they never asked.
It looks like they just started doing this in December 2014. I think the favicon is displayed with search results.
But what happened to...? Whatever happened to Korea's big search engine, Yeti? Did I miss a memo? They disappeared abruptly in August 2013. Other once-familiar faces include:
YioopBot: last seen September 2014
TurnitinBot: sporadic
MJ12 and Gimme60: barely visible
To be continued...