Crawler/1.0 http://elibron.com

Forum Moderators: open

Message Too Old, No Replies

Crawler/1.0 http://elibron.com

No robots.txt

GaryK

12:11 am on Jul 24, 2006 (gmt 0)

Crawler/1.0 [elibron.com...]
83.149.215.35
No PTR

At the same time the above user agent was crawling my sites, Python-urllib/2.4 was crawling my sites from the same IP Address.

thetrasher

10:13 am on Jul 24, 2006 (gmt 0)

I saw it coming from 83.149.215.25 = gw25.jscc.ru
"Python-urllib/2.4" read robots.txt, "Crawler/1.0+http://elibron.com" read the index page.

- Joint SuperComputer Center of Russian Academy of Sciences -

GaryK

3:17 pm on Jul 24, 2006 (gmt 0)

That seems like a silly way to handle robots.txt. It makes it look like the main crawler is ignoring the file thus making it more likely that it'll get banned.

Thanks for the rDNS/PTR.

You've been a wealth of information lately. Thank you.

Pfui

6:36 pm on Jul 24, 2006 (gmt 0)

Two days, three hits, three servers, zero robots.txt as "Crawler" by itself. Also from the looks of the files it tried to hit, it was following dMoz links.

gw26.jscc.ru - - [20/Jul/2006:03:43:19 -0700] "GET /dir/file.html HTTP/1.1" 403 815 "-"
"Crawler/1.0+http://elibron.com"

gw25.jscc.ru - - [20/Jul/2006:22:27:45 -0700] "GET / HTTP/1.1" 403 815 "-"
"Crawler/1.0+http://elibron.com"

83.149.215.35 - - [21/Jul/2006:05:07:10 -0700] "GET /dir2/ HTTP/1.1" 403 - "-"
"Crawler/1.0+http://elibron.com"

Excellent cross-spotting re the Python connection, Gary!

gw26.jscc.ru - - [20/Jul/2006:03:43:18 -0700] "GET /robots.txt HTTP/1.1" 403 815 "-"
"Python-urllib/2.4"
gw25.jscc.ru - - [20/Jul/2006:22:27:44 -0700] "GET /robots.txt HTTP/1.1" 403 815 "-"
"Python-urllib/2.4"
83.149.215.35 - - [21/Jul/2006:05:07:10 -0700] "GET /robots.txt HTTP/1.1" 403 815 "-"
"Python-urllib/2.4"

What a wonky way to do things. Or dumb like a fox -- if Python gets 403'd, Crawler sails on in...

-----

Btw, "Elibron.com" didn't answer the door but Domain-t-o-o-l-s says it says:

"Offers books in different languages, as well as music scores, visual art works, and historic photographs."

And "Brad DeLong's Semi-Daily Journal" (a blog, so no link) says:

"Elibron.com publishes cheap reprints and e-books in very small press runs, though mostly public domain stuff."

So why is a Russian super-computing entity crawling sites using TWO bots, and one from a package-and-sell-as-PDFs site? Is it borrowing the bots, or crawling for Elibron? (Can't say as I like the sound of either of those situ, actually.)

GaryK

7:24 pm on Jul 24, 2006 (gmt 0)

So why is a Russian super-computing entity crawling sites using TWO bots

Maybe you know something I don't. But it seems to me if I wanted to impress someone I could form a company named:

American SuperComputer Center for Browser Studies. ;)

Sounds impressive until you discover the super-computer consists of the desktop, laptop, and four servers on my home LAN, and the browser studies means looking for new user agents. :)