yoofind

Forum Moderators: open

Message Too Old, No Replies

yoofind

yoono webcrawler new look

Hobbs

11:30 am on Apr 29, 2008 (gmt 0)

User Agent:
"yoofind/yoofind-0.1-dev (yoono webcrawler; [yoono.com...] ; crawl@yoono.com)"

Coming from: 194.0.179.zzz

Previously spotted as:
"Mozilla/5.0 (compatible; yoono; [yoono.com...]
[webmasterworld.com...]

And:
"yoono/1.0 web-crawler/1.0 [yoono.com...]
[webmasterworld.com...]

incrediBILL

2:12 am on May 11, 2008 (gmt 0)

This is another one of those "people rated" toolbar things.

So basically, as people crawl your site, so does the toolbar, and your bandwidth gets double-dipped.

The most disturbing thing IMO is from their home page:

COLLECT: Grab what you like in one click (photos, videos, text...)

Excuse me?

Are they hot linking the images and text or scraping it?

Blocked until I learn more and probably blocked afterwards as well.

incrediBILL

2:15 am on May 11, 2008 (gmt 0)

Oh look, more fun from their FAQ:

The bot refreshes the web page titles once a week, and RSS / Atom feeds once an hour. We implemented HTTP conditional GET (with Last-Modified and ETag), so if your server supports it, the impact on your bandwidth and CPU will be minimal.

Ping ping ping...

Receptional Andy

11:04 pm on May 18, 2008 (gmt 0)

A pretty random bot, too:

[18/May/2008:00:47:35 +0100] GET /robots.txt HTTP/1.0 200
[18/May/2008:00:47:39 +0100] GET /robots.txt HTTP/1.1 200
[18/May/2008:00:47:39 +0100] GET /very-old-url HTTP/1.1 403
[18/May/2008:00:47:42 +0100] GET /robots.txt HTTP/1.1 200
[18/May/2008:00:47:42 +0100] GET /very-old-url HTTP/1.1 403

... repeat until yoono webcrawler no longer has any friends. /very-old-url hasn't existed in any form for at least 5 years. And what's with switching between HTTP 1.0 and 1.1, anyway?

[edited by: Receptional_Andy at 11:06 pm (utc) on May 18, 2008]

nicinch

9:42 am on May 30, 2008 (gmt 0)

hey

allright i guess i'll be killed for this but i'm the guy behind the yoono bot. I'll just clear and explain some details or interrogations you guys had.

Yeap we are a people related side bar thingy but the side bar and the crawl are two different processes so no there is no double dip for the bandwith. A user visits a site but there is no crawl here. However yes once a month, every bookmarked site from our index will be crawled once, to make sure it is still up and that we should still propose it. During the crawl yes i will take info such as title etc.. However i fully respect any rules given in the robots.txt file. For rss right now it is off, but when back up i'm leaning torwards a once a day rule. The blog info is outdated I really have to change that, sorry about the misleading.

Ok that's for the crawl, for the "grab what you like issue" rest assured we are talking only about links, you can link an image a site a video etc... For one we really don't have the disk space to allow our users to scrap the whole web leaving the aside the fundemental lack of morals of helping people steal.

If you have any comments questions or even improvements to propose i'll gladly listen, i just started here 6 month ago for the crawl stuff and still adjusting to the companys particular needs taking into account not to crash anybody else's house.

nick

Samizdata

10:37 am on May 30, 2008 (gmt 0)

Welcome to WebmasterWorld Nick

My usual question when deciding whether to allow a bot on my sites is "what's in it for me?"

Got an answer?

nicinch

4:10 pm on May 30, 2008 (gmt 0)

hey

An answer sure a couple even;
The idea behind yoono is quite simple, basically i'm tired of getting sites forwarded to me through google or other juste because they have been around long or have enough money to get good ranking through links and what not. Ok i'm a little judgemental here but there a certain opacity as to how and why.
In yoono the users decide to share all or parts of there bookmarks with others, and are recognized as sharers; basically saying I value this site and it's info as being accurate or usefull or good. I leave out all the extra math to block out any wrong doings. On the basis of this sharing action I'll crawl for existance and add a couple extra algothims to cluster, rank and aggregate the themes of the sites. I'd call it a user sieve approach slightly different from the standard way and providing for maybe a more transparent mechanism.

So what is in it for you , aside from the usual extra interest from people who are surfing and come to your site through us, there is also a general idea of trying to give user's and sites a fairer chances based on votes of their peers. Granted all this might a little naive and too nice to be true, but as you'll see yoono is a free project. feel free to join either a user or a ressource provider or both.
Other than that it doesn't cost you a dime, which is a bare minimum.

Anywho as a site owner I understand being master of one's domain so by adding a simple line in your robots.txt file you won't be bothered by me anymore.

Samizdata

5:22 pm on May 30, 2008 (gmt 0)

by adding a simple line in your robots.txt file you won't be bothered by me anymore

I'm happy to say I was never bothered (automatic 403 on my sites for most bots, including yours) but I can't help thinking that my robots.txt file would be longer than War and Peace if I added a line for every crawler that came along and suggested it.

I might, however, add a line letting you in if I ever hear of people using your service.

Either way, I appreciate you engaging with the webmaster community.

incrediBILL

9:09 pm on May 30, 2008 (gmt 0)

Other than that it doesn't cost you a dime, which is a bare minimum.

It's nice to see a bot owner come to discuss things with webmasters and I'm not trying to be argumentative here, but free is not always free, it comes at a big cost to some webmasters.

Some of us run very big (100K+ pages) and busy sites with all of the hundreds (thousands?) of bots, both legit and malicious, hitting the sites monthly. The sheer volume of bot traffic causes the server to overload every now and then which costs us down time meaning lost traffic, sales and advertising dollars, not to mention lost goodwill from people that tried to access a dead site and now think you're a joke.

To make sure we don't have outages, we often buy or build automatic bot blocking tools to curtail that traffic, and/or pay consultants or admins to continually review access to the site and update robots.txt and .htaccess to curtail it as well.

Therefore, it's far from free for many of us, a down right nuisance at a minimum.

It would be nice if all the bot owners would build a shared cache server so you can download a page once in 24 hours and all the bots then share the data from that cache server instead of 100 or more bots a day asking for the same pages over and over.

Otherwise, more and more bot operators will find themselves locked out from all the big sites that can't handle the excess load caused by the current flood of bot traffic.

Just my $0.02 worth.

[edited by: incrediBILL at 9:10 pm (utc) on May 30, 2008]

jdMorgan

9:59 pm on May 30, 2008 (gmt 0)

For clarification, let me ask: Is this a crawler then, or just a link checker?

That is, do you actually spider the entire target site, or just check the specific links that your users have put into their "shared" bookmark collections? If this is the case, then conceptually, what you've got is more like the DMOZ link-checker, and not a search engine spider. And in that case, a Yoono "crawler" request for a page indicates that at least one Yoono user has bookmarked the page, and we (as Webmasters) can decide if staying listed in the "shared bookmark" collection is worth the monthly bandwidth/cost or not.

One note on robots.txt handling: It is important to check for the "User-agent: Yoono ... Disallow: /" record, but equally-important to note that if there is no Yoono-specific record, then the robot should accept the "User-agent: *" record (if present) as a match in that case. A considerable number of smaller robots miss this important detail (and therefore get banned for robots.txt non-compliance).

Also be aware that

User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Disallow: /cgi-bin

is perfectly-valid -- A single robots.txt record can apply to more than one user-agent. Many smaller 'bots miss this detail as well...

Thanks for engaging with the Webmaster community. We appreciate the information and the fact that you're seeking feedback!

Jim

nicinch

4:27 pm on Jun 3, 2008 (gmt 0)

hi everyone

Ok I was a little quick on the " it doesn't cost you a dime" .
I can understand the bandwith consumption bots can generate. I like the idea of a sort of bot dedicated index instead of accessing the site itself. Granted having a zillion spiders out there doing the same old and reinventing the wheel isn't doing much for anyone. On a side note i'll see if google will give an acces to their index, trying to keep a stable crawler is a pain in my buzom.

This crawler is a link checker in a way yes; i will only crawl links given by users (once gone through a cleaning process) so if I only have one page of site A i'll only get that page. However I do group all pages belonging to a particular host in order to query while respecting delays and max number of concurrent threads. In this way even a robot denied, acces denied or anything else is valuable information. The site probably exists and seems up , so I can still propose it (I have an optimistic approach here).
I will, if allowed to access, fectch more information such as title text, keywords, outlinks etc.. essentialy in a very common cluster and pagerank approach. Due to disk space i'll keep a bare minimum though. Most of the data is destroyed after i calculate certain attributes. If possible I will respect the last modified date.

For the robot.txt part I will respect the general rules and specific ones for the yoono bot. If i am not specifically denied i'll respect the rules for the general bots. This seems the more respectfull way, if you think i should be even more aggressive please let me know how and why.

yoofind

yoono webcrawler new look

Hobbs

incrediBILL

incrediBILL

Receptional Andy

nicinch

Samizdata

nicinch

Samizdata

incrediBILL

jdMorgan

nicinch

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week