Forum Moderators: phranque

Message Too Old, No Replies

Hughes Network Systems cause the most 404 errors

They can put satellites in space but can't parse JavaScripts

         

KenB

2:37 am on Jan 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've been going through my 404 error log looking for bad bots and noticed something interesting. The satellite based ISP Hughes Network Systems is the source of more 404 errors on my site then almost all other sources combined. Yes even more than MSNBot, which is 404 slap happy.

Even though the user agent strings look like normal web browsers of all flavors, it appears that they have an abnormal propensity to screw up web addresses compiled by javascripts. Quite literally almost all of the 404 errors caused by their users are the result of JavaScript URLs not being put together and requested properly.

It absolutely amazes me that the same guys who put satellites into space can't get their systems to parse JavaScripts properly. Maybe the problem is that JavaScript isn't rocket science.

jdMorgan

3:44 am on Jan 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They also do weird stuff like not following redirects -- or they'll follow them, but only after loading the whole page.

The problems all have a common cause: The fact that satellite providers proxy not only the client connection, but the client itself. This is in an effort to compensate for the 'long fat pipe' -- or more accurately, the very very long, and only-somewhat fat pipe.

Since geosynchronous satellites orbit high (22,300 miles) above the earth, signals to/from a satellite have a significant travel time. Let's say you're a satellite user, and you just ping an IP address that is actually located at the ISP's satellite-receiving facility. Your ping request has to travel up to the satellite, then down to the earth station, and then the ping response has to travel back up to the satellite, and then back down to your satellite antenna. The total "air-time" of this signal is approximately 480 milliseconds, or almost one-half second (derived from the total 89,200 miles divided by the 186,000 MPH speed of light). For comparison, I can ping almost any domestic server in less that one-twelfth of that time...

To compensate for this, and to limit the number of client requests that travel through the satellite itself, satellite ISPs actually proxy the client's functions; When the client requests a Web page, the ISP captures and stores it in a server, analyzes it, and looks for all of the objects that it includes, scanning the HTML for <img> and <link rel="xyz"> tags, etc. Then they issue requests 'to the Web' for all of those objects, collect the responses, bundle up the page and all of its included objects, and send the whole mess all at once back to the client on the other end of the satellite link.

On the client end, they have a little proxy host running. This host accepts the 'bundle' of page-plus-objects from the ISP's satellite proxy-client, and passes the originally-requested HTML page on to the 'real' client -- the user's browser. Then as soon as that browser parses the HTML and starts requesting images, stylesheets, and all the other included objects, the local proxy host simply hands back all the objects that the proxy client has already prefetched.

A bit of thinking about this will reveal some of the reasons that this kind of system has trouble with immediately following redirects and with properly-handling client-side scripting -- I'd imagine that sites using JS-heavy AJAX with lots of 'events' are probably really difficult to handle in such a system...

Anyway, had satellite internet service for several years before any alternatives became available in my area, and in debugging various problems (both for myself and for other Webmasters) I learned a bit about how these systems work (and why), so I hope that's useful.

I think it probably *is* easier to design and build the satellite and to launch it and control it, than to try to reliably emulate every possible kind of HTTP and client-side scripting function using a reasonably-small client-side and ISP-side software package...

Jim

KenB

1:18 pm on Jan 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I saw a similar explanation about the latency (unrelated to my 404 issue) from a Hughes guy in another forum. What I find peculiar about this is that it seems like everyone from spammers to Google have figured out how to rebuild URLs in JavaScript. We're not talking complex mathematical calculations here were taking simple stuff like:

function Murl(c){
var a='http://example.com';
var b='/foobar/'
return '<img src="'+a+b+c+'">';
}
document.write(Murl('some.gif'));

Obviously the code does more than just this, but this boils it down to this essence. Instead of executing the function Hughes just requests "some.gif" without executing the function to build the URL.

Like I said, even spammers have figured out how to do rudimentary parsing of JavaScript to rebuild proper URLs. It would certainly improve their user experience.

Heck I bet they could license something that would do this off of Google or someone.

phranque

1:52 pm on Jan 12, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



the problem is probably that the javascript is executed in the user agent (browser) while the proxy server sees urls before they have been massaged by javascript.

KenB

2:51 pm on Jan 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The URLs are IN the JavaScripts files exclusively, they are not in the HTML. The Hughes proxies are attempting to parse the JavaScript files just enough to request any string with a typical filename extension (e.g. "some.gif" from above).

If they wouldn't try to parse the JavaScript files at all or they parsed them correctly the 404 issue would go away. If they are going to try and read URLs from JS files then they need to pseudo execute the JS file enough to actually know what URLs need to pulled and which shouldn't be pulled unless executed by the end user. As it is, they are wasting a lot of their and our server resources on stupid stuff.

thecoalman

3:19 pm on Jan 22, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's not just JS, on my forum all images are served through file.php?id=#*$!X . file.php is sent with the an image header(or whatever the content is) and the image is sent raw. You can for example right click the image after it has downloaded and save it with original filename.

Since nearly all images are served through file.php anyone using Hughes will cause numerous requests for files that do not exist. Instead of requesting file.php?id=#*$!X I get requests for image.jpg

KenB

4:11 pm on Jan 22, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I did end up making direct contact with some folks at Hughes. They don't like the 404 errors any more than we do. They said they are in the final stages of rolling out a new accelerator, which should reduce 404 errors by as much as 90%. They said the new accelerator should be out within the next month or two (but we all know how software updates always get pushed back).

What needs to be understood is that the latency on a satellite connection is already tremendous (e.g. >400ms) due to the distances signals get sent. What Hughes is attempting to do is improve response times by prefetching objects like images, javascripts, stylesheets, etc. while the parent HTML file is being transmitted to the user. By doing this they can eliminate the additional latency of requesting and transmitting the supporting objects between the actual web server and Hughes' network. The end result is faster loading pages and a better user experience.

One way we can really help our users on ISPs like Hughes or in remote corners of the world is to make sure we are making proper use of the expires and cache-control headers. This will allow the users' browsers and their ISPs caching servers can properly cache and pre-fetch supporting objects.

The faster our pages render in the end user's browser, the better their experience will be on our websites so it is really in our best interest to take advantage of the efforts ISPs like Hughes put into prefetching and caching objects.

gouri

3:19 pm on Feb 2, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I read on the internet that Hughes is a provider of satellite internet services.

I was also trying to find out if they have a presence in China but was not able to. I read that they have a presence in North America.

Would anyone happen to know if Hughes provides satellite internet services in China?

The reason I ask is a lot of my 404s everyday are coming from there.

This might help to tie things together.

KenB

4:14 pm on Feb 2, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Given the way satellite service works you can't actually be sure what country a satellite users comes from. I don't know that Hughes breaks their IP address ranges down by country. If they don't, geo location APIs might either declare all users as U.S. traffic or as unknown. For instance, MaxMind GeoIP uses the entry A2 instead of a country code to denote satellite customers another Geo IP database I know about uses 'ZZ' for unknown.