Is there legitimate traffic from webhost ?

Forum Moderators: phranque

Message Too Old, No Replies

Is there legitimate traffic from webhost ?

Peter_S

11:23 am on May 16, 2017 (gmt 0)

Hi,

Since my site is entirely PHP powered, all requests are handled by a PHP script, and on top of these scripts, I filter the access to block "what I think" is unnatural traffic (unwanted bots, as well as suspect requests).

Among all the filtering I am applying, there is one where I resolve the IP address, and I test the hostname. If it includes "famous" webhost domain, I block the access. For example, amazonaws, ovh, etc… or if they include generic words like "sever", "hosted", "vps", etc…

I've been doing this for years ,but I wonder if this is good or bad idea? Is there legitimate traffic which can come from these webhosts? I always assumed it was not humans, and that these hits were totally useless or at worse being from scrapers.

Thanks,

lucy24

5:54 pm on May 16, 2017 (gmt 0)

“It depends.”

Some law-abiding crawlers are distributed so they could come from anywhere, and you don’t necessarily want to block them universally. Everyone started out small; it’s rare for an entity to have its very own allocated range from Day 1.

When I saw the subject line I thought it would be a question about traffic from your own host, which can also occur. For example I've lately discovered that one category of piwik requests comes from my server's own IP rather than from the human making the visit.

keyplyr

9:42 pm on May 16, 2017 (gmt 0)

@Peter_S - many server farms (Amazon, OVH, etc) host bots that may be a huge bennefit to your interests. You need to closely watch your raw server logs, identify & research these agents to decide if they should get access.

Also, many apps make their home in the ranges that you're currently blocking. This is another category to be considered. Social Media is a huge source of traffic. If you block the apps from access to your server, they will not present your site or its assets to their users.

Also schools will often lease ranges from larger hosts. This can be a big source of traffic.

Blocking server ranges may or may not be an effective defense for unwanted activity at your web site. Hosting companies lease ranges to a wide variety of clients, not all necessarily negative to your site's interests. Some may be extremely helpful.

Blocking server ranges is not something you do then forget about. You must be prepared to watch your server logs each day with diligent focus to see just who exactly is being blocked. This takes consistent maintenance.

tangor

12:38 am on May 17, 2017 (gmt 0)

If not consistent, as keyplyr correctly advises, at least make a thorough raw logs examination for 30 days or more to identity what kind of traffic is hitting your site. Some server farms just aren't worth the effort to poke holes, others are. You won't know which until you see what your traffic is like over an extended period of time.

You can't "set it and forget it".

What works in January might be completely awful by July.

phranque

1:05 am on May 17, 2017 (gmt 0)

practically speaking, most web hosts configure for account holders a hostname in a form similar to username.webhost.com which could be a legitimate site.

keyplyr

1:22 am on May 17, 2017 (gmt 0)

practically speaking, most web hosts configure for account holders a hostname in a form similar to username.webhost.com which could be a legitimate site.

Sadly, a very small number of my allow/disallow UAs do that; only the well established companies that do long term leases.

Since cloud computing became popular, most just do short term ad hock. They get free panels & set-ups, paying only for storage & data transfer which is scaled per node.

From what I see, it's not uncommon for an actor to come from a half-dozen different ranges owned by the same host, or a couple different hosts, all in a few seconds... not a botnet, just cloud accounts.

Peter_S

11:16 am on May 17, 2017 (gmt 0)

Thank you all for your replies.

As for a crawlers, I process them separately. Beside famous crawler (Google, Bing,...) To identify a potentially "good" crawler, it has to have an URL part of the user agent, which has to point to an existing page, it has to have requested the robots.txt file less than 2 weeks ago (my robots.txt file is also a PHP script in fact, so I record hits), and comes from the same /24 IP range as the robots.txt request. Then I watch at them to see if their behavior is acceptable or not. Also, I ban immediately all IP with Googlebot or Bingbot UA which are not coming from known Google / Microsoft IP ranges (and this happens often).

Thank you for pointing me about App, I'll work on this, and refine my blocking criterias.

csdude55

12:18 am on May 20, 2017 (gmt 0)

On my end, I set up a list of "approved" GET variables, and if a request is made with any other GET variable then I add their IP to a MySQL table called "blacklist". Then I block any access from that IP for 2 weeks.

I tried blocking unwanted bots before, but then saw a decrease in real traffic, too. I could only guess that I was inadvertently blocking people using a proxy that didn't realize it, or maybe blocking a search engine crawler.

keyplyr

12:33 am on May 20, 2017 (gmt 0)

For those who choose, here is a list of Blocking Methods [webmasterworld.com]

lucy24

1:05 am on May 20, 2017 (gmt 0)

But that's not really methods is it, more a range of possible reasons. How-to and why-to are almost entirely different questions. Rarely, as in csdude's example, you can combine them: the action of asking for something you're not supposed to triggers the reaction of an immediate block. But that's probably more effort than most people are prepared to make, unless you either have a very large site or you really enjoy writing code.

keyplyr

2:48 am on May 20, 2017 (gmt 0)

But that's not really methods is it, more a range of possible reasons.

Methods just without application (code) which should be done in the various code forums depending on your favorite programing language.