Forum Moderators: open
91.205.124.0 - 91.205.127.255
netname: GIGABASE-NET
descr: Gigabase Ltd
country: RU
Rather odd that it claims in the UA to be a UK domain until one reads the blurb at gigabase.com, which claims it's multi-national...
"The company was registered on August 2008 and is financed by it's foundators solely." (what's a foundator?)
Sounds like it's masquerading as a UK company without actually being one. GIGABASE Ltd isn't registered with Companies house (or CH isn't admitting it is!) and nor is yanga under than name
yanga.co.uk is hosted at 92.241.182.* which is a russian colo (Wahome colocation). The yanga.co.uk site seems to consist of a single home page with search box - no links to any other page saying who/what/why.
Google returns four hits on "GIGABASE Ltd", one of which is webalta in russian.
I say stupidly because they tried to grab my media files directly without going thru my webpages.
And of course, with leach protection for .jpg's enabled they all failed! Didn't even have to 403 them, tho now I have.
Strangely I had something that claimed to be yahoo vertical mail crawler try something similar. regular Slurp doesn't even touch my media files nowadays tho they did for a while.
Yahoo or not (is it really Yahoo, seems to be one of their ranges?) I had to ban them for being stupid and ignoring 302 errors.
Being russian - I don't supposes it part of the botnet game? Nah. Too public, surely. And if it's going straight for media (can't corroborate, haven't checked the site logs) then there wouldn't be much point. Still, again being russian, it's banned.
ignoring 302 errors
302 is not an error - it's effectively temporary redirect.
--
This is not to defend their practices or intentions, just telling you what I know about this user-agent.
yanga search engine - mozilla.feedback.firefox ¦ Google Groups [groups.google.com]
My name is Alexey and I am owner and CEO Yanga project. Now we have only one search cluster and if this cluster down we use Yahoo API as next cluster. Sorry, but we don't have money for two clusters now :( Now we use 100% our results.
Also we have a backlinks for SEO [yanga.co.uk...] with text links.
I don't have any botnets, we planned to start partnership programm with toolbar traffic (As Ask,Google,Miva,Yahoo ... etc).
If you have any question - write me :)
ps. wahome.ru - is biggest russian datacenter with 6000 servers.
It seems that the logic in your robots.txt parser needs some improvement. Although our robots.txt file is whitelist-based and Yanga does not appear on the whitelist, it still attempts to fetch pages:
91.205.***.8 - - [13/Oct/2008:07:06:08 -0700] "GET /robots.txt HTTP/1.1" 200 3157 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"
91.205.***.8 - - [13/Oct/2008:07:06:09 -0700] "GET / HTTP/1.1" 403 666 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"
As you might understand, webmasters become very suspicious when a robot violates robots.txt. As an example, since I use a whitelist, I observe new robots that appear in our access log file, and only take action to "Allow" those that seem to offer some advantage (that is, search-driven traffic) and obey the initially-denied state expressed in our robots.txt file. Unfortunately, Yanga failed this test.
To be clear, our robots.txt is constructed like this (simplified example):
# Whitelisted user-agents are allowed
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
Disallow: /admin
Disallow: /cgi-bin# Disallow all others
User-agent: *
Disallow: /
Yanga does not parse this file correctly, and attempts to fetch resources from the site. All of the "allowed" user-agents parse this robots.txt file correctly, as do many other "disallowed" user-agents.
I strongly suggest fixing this problem before your robot's reputation is destroyed by threads like this one, many of which will be less-informed and more suspicious.
Jim
Are you aware that the Live robots.txt validator doesn't like that format?
Error: MSNBOT isn't allowed to crawl the site.
**************************************************
Line #3: User-agent: slurp
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Line #4: User-agent: msnbot
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Line #5: User-agent: teoma
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Google likes it, it should be valid, but Live's validator doesn't.
Don't know if msnbot reads that right or wrong.
Anyway, not trying to hijack the thread away from Yanga, just pointing out the SEs have disagreements on that particular file implementation.
Another reason I went to dynamic robots.txt and serve it up on demand so there's no room for interpretation of my exact intent.
On this particular site, I don't have the option of doing dynamic robots.txt -- The host Aliases robots.txt to their script that apparently checks to be sure that the shopping cart scripts they provide are Disallowed (or some such check), and if so, pipes the customer's robots.txt through their script. As a result, we're into the content-handling phase of the API -- No SSI, no scripts, and no mod_rewrite available any more. It's the only thing I really dislike about this particular host. If I move, that'll be why.
Nevertheless, this construct was included in the original Standard, and I continue to take search engines to task if they mishandle it. I've already gone one round with Live, and the result was that they fixed that aspect of their parser, and also the previous (very annoying) problem of not differentiating their various user-agents strings when parsing robots.txt. So they do listen and act. Their support group is aware that the validator parser needs to be updated/sync'ed with the real one -- Hopefully that will be acted upon, too.
I tend to make a lot of noise at the search engines themselves, and only gripe here if they do nothing... Next up is Cuil; I've tried just about everything to make Twiceler aware that it can fetch some pages, but it's just not very smart.
Jim
[edited by: incrediBILL at 1:12 pm (utc) on Jan. 23, 2009]
[edit reason] removed comment, see TOS #26 [/edit]
The only recourse I see to protect the availability of my site to human users is to ban some of these "me too" SE bots, especially when they don't result in enough traffic to justify their existence and/or they fail to obey the "crawl-delay" directive.
Yanga bot has been put on my ban list for the above reasons and because it is on a Russian IP address while claiming to be a UK company. So many spam bots are coming out of Russia now that banning Russian based bots is a necessity.
One solitary post by the "supposed" owner without substantiation is not enough to convince me that their game is legit.