Forum Moderators: phranque

Message Too Old, No Replies

Is this what a dual-home'd residential hit looks like?

Seeing interleaved web-hit from different IP's

         

SumGuy

12:49 am on Oct 28, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



I noticed a hit to my website yesterday that is unlike any I've ever seen before. A hit to our default page normally results in 32 file-requests (1 html, 1 favicon.ico and 30 gif files). I've previously mentioned here that a few months ago I started to see incomplete or partial page-loads where the default.html and the first 2 gif files are requested - but no others. A hit yesterday started that way - but it came from 2 different IP addresses.

I'll call 74.87.56.x IP address A, and 75.85.156.x IP address B.

A and B are both road runner IP's, located in Hawaii.

The hit starts at time-zero: IP-A requests default.html and the first gif file, and IP-B requests the second gif file. The time-precision of my logs is only in terms of seconds, so this combined hit happened during the same second. Overall, it looks like a request for the html file and the first 2 gif's.

12 seconds later, all during the same second, IP-A and B take turns requesting the next 29 files. They do so mostly in groups of 3 (IP-A requests 3, then B requests the next 3, then A requests the next 3, etc).

Then 17 seconds later, during the same second, 26 other files are requested by A and B, also mostly in groups of 3 each, with the first 6 requests being for html files that are linked from the default.html file. An actual person would only select one of those at a time - each html file is associated with a different set of gif files.

The user-agent for both A and B are identical:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36

There was a referrer transmitted with the first file-request, it was:

https://www.google.com/

All subsequent referrers were from my site/domain (as expected).

The IP's in question are not even in the same /16 subnet! Would they be part of the same dual-homed network? At someone's home? The time-coordination between the two IP's would indicate they have a close physical association with each other. The time of the hit would have corresponded to mid-afternoon (Hawaii time).

But the alternating file requests, in groups of 3, is what I find interesting, and the 12-second gap between grabbing the first 3 files and the remaining 29 files.

[edited by: phranque at 1:57 am (utc) on Oct 28, 2018]
[edit reason] unlinked url [/edit]

lucy24

2:29 am on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm.

The multiple-IP thing is more common with mobiles, where they can hop randomly around different IPs. But I've occasionally seen the same thing with non-mobiles. (Plays havoc with my log-wrangling: Oh, I see, this isn't a robot, it's pages and supporting files distributed among two dissimilar IPs belonging to the same entity.) The timing is admittedly odd; one possibility that comes to mind is remote caching on the ISP's part--but why on a first request? People in Hawaii aren't accustomed to waiting 12, 17 or 29 seconds for a page to load up, are they?

Was there anything more to the referer than google dot com? If so, the Forums ate it.

Did the initial page request come with any cookies? Has the visitor ever been at your site before? The easiest way to tell is from analytics. The UA is probably too generic to be useful, unless you've got a very small site.

:: business with logs ::

I remember this one jumping out at me a few days ago because of the total disparity of
200.68.129.abc = 187.237.25.abc
(Both, it turns out, are in Mexico.) Desktop, not mobile. And, now that you mention it, it did take them eleven seconds to get everything. Analytics say they reopened the same page a day or two later. Yes, there are robots that plant fake analytics hits--but they don't normally bother with images and stylesheets on their scouting visit, instead proceeding directly to scripts.

Huh. Never realized that the Edge UA string also contains the element “Chrome” (and hence also the element “Safari”), same as the webkit version of Opera. But I digress.

keyplyr

3:55 am on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Has all the signs of being a bot, but to be sure, you'll need to look at the request headers. Humans will use language fields, bots will not (usually.)

There are a couple discussions linked from this page:
Blocking Methods [webmasterworld.com]

lucy24

4:29 am on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Humans will use language fields, bots will not
In the one I was looking at, the language header--which would have got the request denied if it had been absent--is a wholly plausible-for-the-source en-gb first, es-mx second. Admittedly, if it were a botnet running off infected machines, they could camouflage themselves by sending the infected browser's default headers.

I'm confident mine's a human--weirdly spaced-out requests and all. Cross-checking the piwik id cookie shows that they previously visited in September, viewing a number of pages after initially arriving from {up-and-coming search engine}, which tends to send good visitors. (“Good” in this context means that they stick around, which in turn means that the search-engine offering was appropriate to whatever they were looking for.)

:: wandering off to explore search-engine offerings ::

keyplyr

7:59 am on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So what I'm suggesting is that SumGuy check the request headers.

SumGuy

2:16 pm on Oct 28, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



There are about a dozen pieces of information, separated by commas on each line of the IIS4 logs. Beyond the file being requested, the user-agent, and the referrer there are always, always a couple of blank fields at the end that I don't know what's supposed to be there, All this in addition to the date, time, IP address, bytes transferred, length of time to complete the request, http result code, some win32 status code, etc. That's all I see - that's what I'm working with.

> Has all the signs of being a bot

How does a bot perform this level of http-request coordination across two vastly different IP addresses? Why would it do that? It just makes it look more suspicious than if the entire hit happened from a single IP. In fact, this is exactly why I noticed this hit - because when I bring the log file into excel and sort based on IP, this initially looked like two hits from different IP's. It was only when I looked at the files being requested by each IP did I see that these two hits did not make any sense. So I went back to the original log and looked at the time-sequence and saw how coordinated they were.

graeme_p

3:31 pm on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A bot using a service to avoid bot blocking? They will rotate requests between IPs so the right combination of service and bot using it could produce it, perhaps?

If both IPs belong to the same ISP, maybe its something in their setup. A proxy?

SumGuy

4:28 pm on Oct 28, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



> A bot using a service to avoid bot blocking?

You'd think that a bot could replicate exactly a web-hit from a human using a browser. Use a legit user-agent, download all files that a browser would, in the correct order, don't just scrape html pages, etc. Would be easier to do that, from a single trojanized PC sitting behind a single IP, than to coordinate essentially the same thing - grab all pages, even gifs, from 2 different IP's, presumably the bots operating from the 2 ip's need to be in intimate contact with each other to pull this off in the way I see it play out in the logs.

In fact, if this hit would have happened from a single IP, I'd look at all this and say yup - that's a real person. I see a legit user-agent, I see the referral from google, I see all the right files being downloaded in the correct order, etc. I guess this could still be a real person - if they are being served by two road-runner services that are being combined in their router (dual-homed). I understand that it is possible to get two IP addresses from your (residential) ISP on the same account/service.

Alternatively, perhaps they have both an IPv4 and IPv6 address at home, and their browsing requests go out on both, but since my server is only IPv4, at some point their IPv6 packets have to be gated to an IPv4 address, hence I see their hit as coming from 2 IPv4 IP's ?

lucy24

4:35 pm on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Beyond the file being requested, the user-agent, and the referrer there are always, always a couple of blank fields at the end that I don't know what's supposed to be there
Different server types put their logs in different orders, but there will always be a couple of blank fields for things that could be present, but aren't. (For example, the typical Apache log entry contains the element - - representing two login-related fields that aren't needed in most requests.)

Logging headers is a separate process. There has been some discussion of it on the Apache subforum, involving a basic bit of php, but of course you'd need a different formula for IIS.

If both IPs belong to the same ISP, maybe its something in their setup. A proxy?
That would have been my first-choice interpretation too: the ISP is doing something--possibly involving a phrase like “load balancing”--that makes sense to them, even if it makes sense to nobody else and creates confusion for the four people on the planet who actually look at their raw access logs ;)

keyplyr

7:36 pm on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry, I don't see what the mystery is.

Just a bot coming from cloud hosting where the IP address will vary within the range.

Almost all ISPs also have cloud services now.

SumGuy

11:43 pm on Oct 28, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



> Just a bot coming from cloud hosting where the IP address will vary within the range.
> Almost all ISPs also have cloud services now.

If Timewarner / Spectrum offers cloud / hosted services, I don't see it on any webpage I can find.

IP's come up as residential according to the various IP-lookup portals I'm using. The first IP looks like it's in California - but tracert indicates that it's in Hawaii. The second one is no doubt in hawaii.

If both are part of ranges assigned to hosting services, then I shouldn't be the only one seeing hits from them.

keyplyr

11:53 pm on Oct 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Cloud = Dynamic. Most ISPs will use dynamic IP assignments.

I say I don't see a mystery not because I want to disavow what you think is happening, but because I see this every single day, and much more. I scan through logs several times a day on my 3 sites as well as at least once or twice a week on the several sites I manage for others. The more one does this, the more you aquire a feel for what is happening.

I suggested you look at the request headers. I gave you a link that leads to a discussion on how that is done. Grepping request headers (not access log entries) will pretty much tell you if the request(s) came from a human or a bot, since bots do not usually send language headers.

But as I've said, I see this as pretty common and nothing to be concerned about.

SumGuy

1:38 am on Oct 29, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



This is all I have:

ClientIP, UserName, RequestDate, RequestTime, Service, ServerName, ServerIP, ProcessingTime, BytesReceived, BytesSent, Status, Win32Status, Operation, TargetURL, UserAgent, UrlReferer, Parameters

The last item - Parameters - is always blank.

What-ever the "http request headers" thing is, I don't see it in my logs. Where is it supposed to show up?

keyplyr

1:59 am on Oct 29, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What-ever the "http request headers" thing is, I don't see it in my logs
request headers (not access log entries)


Again, please follow directions in the link I gave you :)

[webmasterworld.com...]

lucy24

9:34 pm on Oct 29, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



please follow directions
Please keep in mind that on these forums, almost all discussions of header logging have involved php on Apache servers. SumGuy is on IIS and will therefore need to follow entirely different steps.

SumGuy, here's an illustration of the difference between the two types of logs. The access log entry will look a little different from what you're used to, since it's Apache, but you should recognize the pieces.

Server access log:
46.229.168.143 - - [28/Oct/2018:00:05:01 -0700] "GET /ebooks/salmonia/ HTTP/1.1" 200 187103 "-" "Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)" 

Logged headers using code originally developed by incrediBill and mildly tweaked by me:
2018-10-28:00:01:14
URL: /ebooks/salmonia/
IP: 46.229.168.153
User-Agent: Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)
Connection: close
Accept-Encoding: gzip,deflate
Accept: text/html
Host: example.com
This is an authorized robot; humans will typically include a few more header fields, while malign robots typically have fewer header fields.

keyplyr

10:54 pm on Oct 29, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy24 - thanks for the FYI

SumGuy

12:48 pm on Oct 30, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



I'm going to look into that. Will it work on IIS4? I guess I will have to activate / allow server-side scripting?

Do all browers follow or obey the server directive to transmit that information? Are there add-ons or settings that disable it?

TorontoBoy

2:24 pm on Oct 30, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



It looks like an http request from a person, but packet switched. What I mean is that once the browser knows what it wants, it asks multiple resources/IPs to get them, and then reassembles them in the proper order once received. A packet, in this case, is a file such as an image? Packet switching has the benefits of possibly being faster, especially if one packet is corrupted and needs to be resent. It is also more secure, due to using multiple resources/IPs.

But I have never heard of packet switching on a human browser, using multiple IPs and then reassembling everything before rendering it for human eyes.

I do agree that the request header's lang option would show if it was/usually human or bot. There has got to be a way to see request headers on IIS. These are internet standard headers, and not specific to any server type.

lucy24

5:42 pm on Oct 30, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do all browsers follow or obey the server directive to transmit that information? Are there add-ons or settings that disable it?
Headers are part of a request. If they aren't sent, no valid request is received. In the example I quoted, everything after the first three lines (timestamp, IP, requested resource) is a named header. Certain header fields, like User-Agent and Referer, are logged by the server. I think you could theoretically--if it's your own server--customize logs so all header fields, or at least all standard ones, are shown all the time, but what a mess that would be.

In particular:

All requests include a hostname. (They don't have to, but if you are on shared hosting, the request will never reach your site otherwise.)
All legitimate requests include a User-Agent.
Some requests include a Referer. Humans can choose not to send one, based on browser settings.
Most requests include a Connection: (There was some discussion of this in one of the recent threads about logging headers, leading to the conclusion that my host supplies one when absent, and gives them all the same value.)
All human requests include three headers with names beginning in “Accept”: Accept itself, Accept-Language and Accept-Encoding. Some older browsers may still send an Accept-Charset, but I see it more often in Asian robots, including legitimate ones.
Some search engines still send a Pragma: header, although it was supposed to have been superseded by Cache-Control: years ago.
Humans sometimes send Dnt: (Do Not Track) and/or an Upgrade Insecure Requests:
There are a few other fairly common headers that I can't remember at the moment. While, on average, a human request includes more headers than a robot, a few header fields are found only in robotic requests. Some are obvious typos, while some may be ancient, long-deprecated headers that human browsers stopped sending in 2004.

Apart from Referer and Do not track, header settings are generally out of human control. If you've ever run the w3 link checker, you may remember that it lets you choose whether it sends the Accept-Language header.

For comparison purposes, here is the complete header set of a malign robot (picked at random):
2018-10-28:02:32:36
URL: /wordpress/wp-admin/
IP: 50.62.160.99
Connection: close
Host: example.com
I do not need to cross-check server access logs to know that this request was blocked.

Edit: I got curious and looked it up. In Apache--and presumably also in IIS--you can customize your log settings to include the value of any named header. That is, you can't say “dump all headers into the log”, but you can ask for as many specific headers as you like. Moreover, you can log the content of any named response headers as well.

[edited by: lucy24 at 5:55 pm (utc) on Oct 30, 2018]

lucy24

5:50 pm on Oct 30, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have never heard of packet switching on a human browser
“Packet switching” may not be the operative term here. Each file is a separate request, and it's up to the ISP how to handle successive requests that ultimately go to the same human computer. At a minimum, a page request is separate from supporting-file requests, since the browser doesn't know what other files to ask for until it has read the page. You may remember that AOL requests--especially AOL dialup--came from all over the map. Requests from mobiles are also likely to be widely distributed. (I would prefer to think this is not because the visitor is surfing the web while driving, hopping along from one tower to the next. If you happen to live equidistant from two or more cell towers, two successive requests may end up following entirely different routes.)

SumGuy

1:19 pm on Oct 31, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



[microwww.com...]

LogFilter is an ISAPI log filter for the Microsoft IIS 3 webserver that will add the USER_AGENT and URL_REFERER fields to your Microsoft IIS log files.

IIS Standard log Format:
ClientIP, UserName, RequestDate, RequestTime, Service, ServerName, ServerIP, ProcessingTime, BytesReceived, BytesSent, Status, Win32Status, Operation, TargetURL, Parameters

LogFilter log Format:
ClientIP, UserName, RequestDate, RequestTime, Service, ServerName, ServerIP, ProcessingTime, BytesReceived, BytesSent, Status, Win32Status, Operation, TargetURL, UserAgent, UrlReferer, Parameters

==================

I downloaded that filter (logfilter.dll) a long time ago and was able to add/integrate it into the IIS4 control-panel thing. I wanted to get user-agent and referrer info that the original /stock IIS4 wasn't giving me. Other than that, there is no way (from a control-panel / GUI point of view) that I know of to be able to specify what items get logged.

lucy24

6:16 pm on Oct 31, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Again I don't know about IIS, but Apache log formats can only be set in the config file, which means by the server administrator. On shared hosting there is no way for individual sites to change what gets logged, or how.

Header logging is a completely separate process because it isn't done by the server. It's simply another php function that executes when a page--potentially including your custom error pages--is prepared. As far as the server is concerned, it's no different from putting together a page with .php extension; the output just happens to get sent to a text file instead of back to the visitor.