Yanga WorldSearch Bot

Forum Moderators: open

Message Too Old, No Replies

Yanga WorldSearch Bot

keyplyr

7:03 pm on Sep 11, 2008 (gmt 0)

Read robots.txt, then disobeyed it and requested disallowed files. Russian owned, no contact info. Banned until they conform.

77.91.224.6 - - [11/Sep/2008:09:17:39 -0400] "GET /robots.txt HTTP/1.0" 200 4651 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)

incrediBILL

1:35 am on Sep 12, 2008 (gmt 0)

It appears to be associated with Webalta:

inetnum: 77.91.224.0 - 77.91.224.255
netname: WEBALTA-NET
descr: WEBALTA / Internet Search Company
Search Engine Servers
country: RU

keyplyr

6:50 am on Sep 12, 2008 (gmt 0)

Well then banned by range :)

dstiles

9:44 pm on Sep 19, 2008 (gmt 0)

I've been getting this to the home page of most domains on my server over the past couple of hours but from the range 91.205.124.* which resolves as:

91.205.124.0 - 91.205.127.255
netname: GIGABASE-NET
descr: Gigabase Ltd
country: RU

Rather odd that it claims in the UA to be a UK domain until one reads the blurb at gigabase.com, which claims it's multi-national...

"The company was registered on August 2008 and is financed by it's foundators solely." (what's a foundator?)

Sounds like it's masquerading as a UK company without actually being one. GIGABASE Ltd isn't registered with Companies house (or CH isn't admitting it is!) and nor is yanga under than name

yanga.co.uk is hosted at 92.241.182.* which is a russian colo (Wahome colocation). The yanga.co.uk site seems to consist of a single home page with search box - no links to any other page saying who/what/why.

Google returns four hits on "GIGABASE Ltd", one of which is webalta in russian.

keyplyr

8:38 am on Sep 20, 2008 (gmt 0)

The old shell game. Webalta/Yanga has been at it for over a year now. Anyone know what they're doing with the data?

91.205.124.6 - - [20/Sep/2008:01:18:56 -0400] "GET /index.html HTTP/1.0" 403 461 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"

Megaclinium

11:53 pm on Sep 20, 2008 (gmt 0)

I had them kind of stupidly hit me from same address .14

I say stupidly because they tried to grab my media files directly without going thru my webpages.

And of course, with leach protection for .jpg's enabled they all failed! Didn't even have to 403 them, tho now I have.

Strangely I had something that claimed to be yahoo vertical mail crawler try something similar. regular Slurp doesn't even touch my media files nowadays tho they did for a while.

Yahoo or not (is it really Yahoo, seems to be one of their ranges?) I had to ban them for being stupid and ignoring 302 errors.

dstiles

1:21 am on Sep 21, 2008 (gmt 0)

... doing with the data... No idea. I only looked at the google cache version of the site so I made no search.

Being russian - I don't supposes it part of the botnet game? Nah. Too public, surely. And if it's going straight for media (can't corroborate, haven't checked the site logs) then there wouldn't be much point. Still, again being russian, it's banned.

Lord Majestic

11:23 pm on Sep 22, 2008 (gmt 0)

This is webalta search project purchased by some other Russian entity (mobile content portal - artfon.com - notice webalta logo on press release they have) who appear to have worldwide search ambitions. I do not think they are botnet related even though they have certainly Russian origin.

ignoring 302 errors

302 is not an error - it's effectively temporary redirect.

This is not to defend their practices or intentions, just telling you what I know about this user-agent.

nativenewyorker

2:31 am on Oct 12, 2008 (gmt 0)

Suspicions about Yanga being malevolent may be justified. There are reports of a Yanga search engine hijacking IE and FF browsers and replacing Google with itself.

yanga search engine - mozilla.feedback.firefox ¦ Google Groups [groups.google.com]

unimaximus

6:08 pm on Oct 13, 2008 (gmt 0)

Hello guys

My name is Alexey and I am owner and CEO Yanga project. Now we have only one search cluster and if this cluster down we use Yahoo API as next cluster. Sorry, but we don't have money for two clusters now :( Now we use 100% our results.

Also we have a backlinks for SEO [yanga.co.uk...] with text links.

I don't have any botnets, we planned to start partnership programm with toolbar traffic (As Ask,Google,Miva,Yahoo ... etc).

If you have any question - write me :)

ps. wahome.ru - is biggest russian datacenter with 6000 servers.

jdMorgan

7:45 pm on Oct 13, 2008 (gmt 0)

Hello Alexey, and welcome to WebmasterWorld!

It seems that the logic in your robots.txt parser needs some improvement. Although our robots.txt file is whitelist-based and Yanga does not appear on the whitelist, it still attempts to fetch pages:

91.205.***.8 - - [13/Oct/2008:07:06:08 -0700] "GET /robots.txt HTTP/1.1" 200 3157 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"
91.205.***.8 - - [13/Oct/2008:07:06:09 -0700] "GET / HTTP/1.1" 403 666 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"

As you might understand, webmasters become very suspicious when a robot violates robots.txt. As an example, since I use a whitelist, I observe new robots that appear in our access log file, and only take action to "Allow" those that seem to offer some advantage (that is, search-driven traffic) and obey the initially-denied state expressed in our robots.txt file. Unfortunately, Yanga failed this test.

To be clear, our robots.txt is constructed like this (simplified example):

# Whitelisted user-agents are allowed
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
Disallow: /admin
Disallow: /cgi-bin
# Disallow all others
User-agent: *
Disallow: /

As you can see, the four named user-agents are allowed to fetch everything except two URL-paths, while all other user-agents are disallowed. This construct is in full compliance with the Standard for Robot Exclusion (Martijn Koster).

Yanga does not parse this file correctly, and attempts to fetch resources from the site. All of the "allowed" user-agents parse this robots.txt file correctly, as do many other "disallowed" user-agents.

I strongly suggest fixing this problem before your robot's reputation is destroyed by threads like this one, many of which will be less-informed and more suspicious.

Jim

incrediBILL

2:24 am on Oct 14, 2008 (gmt 0)

Jim,

Are you aware that the Live robots.txt validator doesn't like that format?

Error: MSNBOT isn't allowed to crawl the site.
**************************************************
Line #3: User-agent: slurp
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Line #4: User-agent: msnbot
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Line #5: User-agent: teoma
Error: 'user-agent' tag should be followed by a different tag.
**************************************************

Google likes it, it should be valid, but Live's validator doesn't.

Don't know if msnbot reads that right or wrong.

Anyway, not trying to hijack the thread away from Yanga, just pointing out the SEs have disagreements on that particular file implementation.

Another reason I went to dynamic robots.txt and serve it up on demand so there's no room for interpretation of my exact intent.

jdMorgan

2:41 am on Oct 14, 2008 (gmt 0)

Live's 'bot handles it fine, but their validator doesn't use the same parser, apparently. The validator is just wrong in this case -- It's perfectly valid to construct policy records that apply to more than one user-agent, and this has been in the Standard from Day One.

On this particular site, I don't have the option of doing dynamic robots.txt -- The host Aliases robots.txt to their script that apparently checks to be sure that the shopping cart scripts they provide are Disallowed (or some such check), and if so, pipes the customer's robots.txt through their script. As a result, we're into the content-handling phase of the API -- No SSI, no scripts, and no mod_rewrite available any more. It's the only thing I really dislike about this particular host. If I move, that'll be why.

Nevertheless, this construct was included in the original Standard, and I continue to take search engines to task if they mishandle it. I've already gone one round with Live, and the result was that they fixed that aspect of their parser, and also the previous (very annoying) problem of not differentiating their various user-agents strings when parsing robots.txt. So they do listen and act. Their support group is aware that the validator parser needs to be updated/sync'ed with the real one -- Hopefully that will be acted upon, too.

I tend to make a lot of noise at the search engines themselves, and only gripe here if they do nothing... Next up is Cuil; I've tried just about everything to make Twiceler aware that it can fetch some pages, but it's just not very smart.

Jim

koan

8:03 am on Nov 11, 2008 (gmt 0)

On the homepage, their example for search is: "Example: escorts in virginia". I really don't think they're legit.

keyplyr

10:27 am on Nov 12, 2008 (gmt 0)

6 months ago I decided to allow Yanga to crawl. While I am not seeing measurable traffic from them yet, they do return my pages in their SERP for the appropriate search terms. I'm also not seeing a spammy SERP for my topics. Time will tell.

enigma1

12:52 pm on Nov 15, 2008 (gmt 0)

when I check my logs they come from an IP that does not resolve properly. I check them by User Agent and send them unknown binary content if detected. As far I could see they had a single page from my site indexed and several on "how to crack" my applications.

soothsayer

7:07 am on Jan 19, 2009 (gmt 0)

found out about this bot when they were slurping up my websites wholesale. i tried blocking them, first llnw.net, then the bots switched to 77.91.224*, next to a telenet.be* address, there's also a 91.205.124* addy that they use. all these addresses passes through the llnw.net network.

[edited by: incrediBILL at 1:12 pm (utc) on Jan. 23, 2009]
[edit reason] removed comment, see TOS #26 [/edit]

KenB

2:28 pm on Feb 16, 2009 (gmt 0)

What I'm finding about bots like this is that they are consuming way too much of my server resources compared to the amount of traffic they are generating. All of these "me too" bots are spending so much time indexing my site that bots now account for over half of my traffic. At times hits from the "SE" bots are coming in so fast from different organizations that they end up effectively creating a DOS that prevents real users from accessing my content for short periods of time.

The only recourse I see to protect the availability of my site to human users is to ban some of these "me too" SE bots, especially when they don't result in enough traffic to justify their existence and/or they fail to obey the "crawl-delay" directive.

Yanga bot has been put on my ban list for the above reasons and because it is on a Russian IP address while claiming to be a UK company. So many spam bots are coming out of Russia now that banning Russian based bots is a necessity.

One solitary post by the "supposed" owner without substantiation is not enough to convince me that their game is legit.