Forum Moderators: open

Message Too Old, No Replies

At Home Without the Robots: 2023 edition

         

lucy24

7:22 pm on Jul 16, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As we know, I’ve been tracking robots for years now. Sometimes I look around and realize that some once-familiar visitor hasn’t shown its face in years. Here are some that were once regular visitors--in some cases, familiar enough to go on the Ignore list--and then one day they’re gone.

I’ve tried to shift everything into the past tense, but have probably missed a few.

Search Engines

France: Exabot

IP: 178.255.215
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
Last seen: April 2021

The search engine is called, inexplicably, Exalead. Or was called: Since my site has no French-language content, they never came around much and I don’t actually know if they have retired, or just retired from my site. Although they periodically looked at the sitemap, I don’t think they ever did a full spidering; they came in and asked for specific pages. And finally they stopped coming altogether.

Japan: ichiro

IP: 153.254.146.10
UA: ichiro/3.0 (http://search.goo.ne.jp/option/use/sub4/sub4-1/)
Last seen: December 2020

Well, I think it’s a search engine, but I don’t read Japanese so I can’t swear to it. Almost all their requests were for images. And, finally—stop me if you’ve heard this one—the URL in the UA string redirects. Or did, last I checked. It’s been gone for over two years, so it seems safe to say it has retired.

Targeted Robots

On my site, “targeted” means they came in response to some particular stimulus, such as an RSS feed. Some of them visited diligently for years—and then one day they stopped showing up. This list could be three times as long. Here I’ve only shown the ones that were not comprehensively blocked, whether from natural causes or by fiat.

Buzzbot

IP: distributed
UA: Buzzbot/1.0 (Buzzbot; http://www.buzzstream.com; buzzbot@buzzstream.com)
Last seen: March 2019

This robot may hold the record for Longest Absence. It disappeared in July 2017 . . . and then, in March 2019, suddenly showed up again, acting as if it had never been away. And I haven’t seen it since.

DeuSu

IP: 85.93.91.84
UA: Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
Last seen: March 2018

DeuSu only understood Disallow when given a section to themselves in robots.txt, but once this was sorted out they were fully compliant. Another archaism is that their requests came in as HTTP/1.0—a protocol that is becoming increasingly rare. In fact some sites use it as an automatic block criterion, so this may have been a case of Adapt Or Die.

ExtLinksBot

UA: Mozilla/5.0 (compatible; ExtLinksBot/1.5; +https://extlinks.com/Bot.html)
Last seen: March 2019

FlipboardProxy

IP: distributed
A single visit might cycle among 3, 18, 34, 35, 50, 52, 54, 107 and probably others I’ve overlooked. Their web page says “the Amazon EC2 cluster”, otherwise known as The Usual Suspects.
UA:
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
  Mozilla/5.0 (compatible; FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.6; +http://flipboard.com/browserproxy)
Last seen: February 2020

The two 1.2 User-Agents generally came in pairs: the long version without referer, the short version with RSS as referer. Generally it stopped by twice in the course of a day or two. The 1.6 UA was used only for images, picking up a single copy of everything associated with the file that its siblings had most recently collected. It never got scripts or stylesheets.

Laserlikebot

IP: variously 35 and 104
UA: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Laserlikebot/0.1)
Last seen: August 2019

Sometimes it picked up one image associated with the current page, but generally only the page. If there were new pages associated with the latest one—such as an ebook divided into chapters or installments—it would come back for those a few minutes later.

NetVibes

IP: 193.189.143
UA: Netvibes (http://www.netvibes.com)
Last seen: April 2018

Although it never asked for robots.txt, I counted it on the Acceptable side because it appeared to involve human recommendations. It may still exist; it just hasn’t seen a Like for this site lately.

rogerbot

IP: 209.133.111; 207.126.118
UA: rogerbot/1.0 (http://moz.com/help/pro/what-is-rogerbot-, rogerbot-crawler+shiny@moz.com)
Last seen: May 2021

It looks as if the rogerbot can’t do HTTPS; ever since my site went secure, the rogerbot has requested nothing but robots.txt. (As noted elsewhere, I exempt robots.txt from canonicalization redirects.) I don’t know why it bothered, since it often followed links that clearly said HTTPS. Whether because of the HTTPS barrier or some other reason, it finally gave up in May 2021.

Uptimebot

UA: Mozilla/5.0 (compatible; Uptimebot/1.0; +http://www.uptime.com/uptimebot)
Referer: http://uptime.com/example.com
Last seen: June 2019

UptimeBot was exceedingly slow when it came to robots.txt. But after a month or two of Disallow they eventually got the message and stopped requesting pages, at which point I authorized them. Most of the time they just did a HEAD / (“Does the root page exist? OK, we’re good”) which is hardly a server-intensive request.

Sometimes the referer varied: “uptime-us.net”, “uptime-as.net” and possibly others I’ve overlooked. And sometimes there was no referer. If there’s a pattern, it’s more than I can figure out.

VenusCrawler

IP: 68.74; 76.14
UA: VenusCrawler/Nutch-1.12 (crawler@mycompany.com)
Last seen: June 2018

This, too, counted as an acceptable robot. I did say that my standards are not very exacting.

This and That

In years past I saw a lot of them—often enough that they got authorized. Some may even have made it to the Ignore list. And then they went away.

BacklinkCrawler

IP: 5.9.65.19
UA: BacklinkCrawler (http://www.backlinktest.com/crawler.html)
Last seen: April 2018

I’ve seen a robot with this name as recently as December 2018, but it may be an impostor.

Blekkobot

IP: 38.99.96-97, 199.87
UA: Mozilla/5.0 (compatible; Blekkobot; ScoutJet; +http://blekko.com/about/blekkobot)
Last seen: May 2015

BUbiNG

UA: BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
Last seen: May 2018
BUbiNG is a scalable, fully distributed crawler, currently under development and that supersedes UbiCrawler.

UbiCrawler must have been before my time, since I have no record of ever seeing it.

Although I saw it from two different IP ranges belonging to two different hosts, and the robot is ultimately open source, there were no major differences in their behavior. I put them in the “No skin off my nose” category.

Cliqzbot

UA:
  Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company/cliqzbot)
  Mozilla/5.0 (compatible; Cliqzbot/2.0; +http://cliqz.com/company/cliqzbot)
  Mozilla/5.0 (compatible; Cliqzbot/3.0; +http://cliqz.com/company/cliqzbot)
Last seen: June 2021

“Was genau ist Cliqzbot?” Another of those targeted searches, I think.

Cliqzbot moved up to 2.0 in June 2017, continuing through August 2019, and then returned with 3.0 in November 2019 after an unexplained break.

At some time while I wasn’t paying attention, Cliqzbot simply disappeared; I last saw it in June 2020. There are reports that its code was bought by Brave—makers of the adorably titled Brave Browser—with plans of creating an independent search engine. So far, that’s all we know.

crawler4j

UA: crawler4j (https://github.com/yasserg/crawler4j/)
Last seen: January 2021

Like some old-fashioned robots, it only recognizes its name if it is given a block to itself in robots.txt. After I figured this out, I tried authorizing it—but it doesn’t seem to have made any difference. Since the end of 2018 it has only shown itself two or three times, and has never asked for anything but robots.txt.

ExactSeek

IP: 74.63.43.213
UA: ExactSeek Crawler (nutch 1.4)/Nutch-1.4 (ExactSeek Crawler; http://www.exactseek.com)
Last seen: May 2014

Ezooms

IP: 208.115.111, ..113
UA: Mozilla/5.0 (compatible; Ezooms/1.0; help@moz.com)
Last seen: April 2014

ForcePoint

IP: 96.67.162.200
UA: ForcePointCrawler/Nutch-1.13-SNAPSHOT
Last seen: August 2018

GarlikCrawler

IP: 185.26.92.74
UA: GarlikCrawler/1.2 (http://garlik.com/, crawler@garlik.com)
Last seen: November 2020

Visits ranged all over the map, from a single page to the entire site. This made it one of those indirectly useful robots that help point out bad links within the site: if there’s a 404 in the middle of its crawl, I know I need to investigate my internal links.

Like other robots listed here, it was active for years and then vanished without a trace.

Gigabot

IP: 67.130.216.27
UA: Gigabot/2.0
Last seen: April 2014

MauiBot

UA: MauiBot (crawler.feedback+wc@gmail.com)
Last seen: July 2018

SafeDNSBot

UA: SafeDNSBot (https://www.safedns.com/searchbot)
Protocol: HTTP/1.0
Last seen: January 2021

Rarely requested more than one page on a visit—but when it did, they tended to be deep interior pages, implying that it was getting a shopping list from somewhere else.

This is another of those Now you see them, Now you don’t robots. Earlier, it was gone from September 2019 to January 2021 before making an unexpected reappearance, still using HTTP/1.0. As of today, those January 2021 visits remain its last sighting. We’ll see how long it lasts this time.

SiteExplorer

UA: Mozilla/5.0 (compatible; SiteExplorer/1.1b; +http://siteexplorer.info/Backlink-Checker-Spider/)
Last seen: August 2018

SiteExplorer is another of those robots that only understand Disallow if they’re given a sector to themselves in robots.txt, so I initially thought they were non-compliant. Later evidence suggests that they’re just very, very slow on the uptake: in the course of April 2018 they picked up robots.txt eleven times, but never screwed up the courage to ask for a page until almost the end of the month.

Sonic the Robot

Just kidding. It’s:
IP: 133.9.84.100
Officially they used the range 133.9.84.100-116 and also 133.9.221.22, but I never saw them from anything but the exact IP cited here.
UA:
  Mozilla/5.0 (compatible; Sonic/1.0; http://www.yama.info.waseda.ac.jp/~crawler/info.html)
  Mozilla/5.0 (compatible; YamanaLab-Robot/1.0 (Sonic/1.0); http://www.yama.info.waseda.ac.jp/~crawler/info.html)
Last seen: January 2018

The element “YamanaLab-Robot/1.0” was added around the beginnning of 2018. Yamana Labs is the Computer Science department of Waseda University, so the assorted robots from this neighborhood may have been class projects. Sometimes they made elementary errors like including a fragment # in their request, or leaving off a final directory slash/ although they had no reason to believe this is the preferred format.

spbot

UA: Mozilla/5.0 (compatible; spbot/5.0.3; +http://OpenLinkProfiler.org/bot )
Referer: as if human
Last seen: September 2018

Yes, there’s a space before the closing parenthesis. I’ve known people who use punctuation wonkies in the User-Agent as a block criterion. Each crawl used a new IP, so no help there.

I think their crawling happened on the fly: robots.txt, two forms of root—one of which got a 301—and then all other pages, from top to bottom, with the same referer a human would send. In the rare case that a page is linked from widely separated directories on the same site, the stated referer is whichever one the robot saw first.

Since they didn’t come in with a shopping list, there were never any 301s or 410s. This made it useful for record-keeping purposes: Count the number of spbot requests, subtract two, and that’s how many crawlable URLs you’ve got. (In September 2018, for example, I had 424 crawlable URLs.)

SurveyBot

IP: 64.246.165, 216.145.14
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (DomainTools)
Referer: http://whois.domaintools.com/example.com
Last seen: March 2018

As with the UptimeBot, the referer includes the name of the site they’re visiting (minus “www”).

Consistent pattern: robots.txt and two requests for / root, the first of which receives a 301. (This means they first asked for it with leading “www”. Since robots.txt is not subject to domain-name canonicalization, it only needs to be requested once.)

test Crawl

IP: 54.92.230.108
UA: test Crawl
Last seen: April 2017

I did say I didn’t have very high standards when it comes to authorizing robots. They were very active in the latter months of 2016; after a long gap they made their last visit in April 2017. They were only interested in the /ebooks/ directory: mostly pages, but the occasional stylesheet, and sometimes the first image on a page—regardless of whether that’s a full-color frontispiece or a little icon from the navigation banner.

TurnitinBot

IP: 38.111.147.88, 199.47.87.140
UA: TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
Last seen: August 2019

Another plagiarism checker, though unlike Blackboard Safeassign it never managed to get itself blocked. I don’t think it has actually gone out of business; more likely I just don’t have anything on any of its clients’ reading lists at the moment.

Understandably, their main interest is in the /ebooks/ directory. Editorial Aside: I can only hope the teachers using the service can distinguish between plagiarism of recent scholarship, and legitimate quotations from public-domain texts—which should be present, especially in papers written for literature classes. (I waste no sympathy on the students using the service, presumably to check whether the paper they purchased will pass inspection.)

Interesting quirk: The TurnitinBot alphabetizes its shopping list—at least on my personal site, which is small enough that a single brief crawl will cover the whole thing. And, just like my favorite text editor, their alphabetization places all capital letters before all lower-case letters, with the result that NellysPage.html and TheEgg.html both come before mischief.html, and Xanadu.html comes before online.html.

VeriCiteCrawler

IP: 198.30.168
UA: VeriCiteCrawler/Nutch-1.9
Last seen: October 2017

Wotbox

IP: 94.199.151.22
It began using this IP around 2016. For several years before that, it came from 81.144.138.34.
UA: Wotbox/2.01 (+http://www.wotbox.com/bot/)
Last seen: December 2017

YioopBot

IP: 173.13.143
UA: Mozilla/5.0 (compatible; YioopBot; +http://173.13.143.74/bot.php)
Last seen: September 2014

That’s the “real” YioopBot. For a while, there was a faker or wannabe using approximately the same UA string.

Whew.

And now that that’s out of the way, I can archive my 2022 logs. I typically do it at mid-year, so at any given time I have at least 6 and no more than 18 months of logs readily available.

not2easy

8:55 pm on Jul 16, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Whew, indeed! Thank you lucy24. That is useful work, much appreciated. I've Pinned it to make it easier to find. I see a few I'm still blocking via UA so I will need to cross check.

engine

7:45 am on Jul 17, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Whew and wow, thank you.

tangor

10:34 pm on Jul 17, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's all in the details, and nobody does it better!

Thanks!

blend27

11:39 am on Jul 29, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK, ok, ok, ok...

I was about to ask where is it....

And there we go, Lucy24 did it again!

Like a clock work, year after year! and Nothing Compares to You(yep).

~~ “the Amazon EC2 cluster”, otherwise known as The Usual Suspects. <<< Priceless!