Forum Moderators: open

Message Too Old, No Replies

bnf.fr bot

         

Dimitri

9:06 pm on Jun 24, 2018 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



I am not an expert, so this is only partial information :

UA: Mozilla/5.0 (compatible; bnf.fr_bot; +http://www.bnf.fr/fr/outils/a.dl_web_capture_robot.html)
Robots.txt: No

IP: 194.199.7.** [ robot4-depot-legal-web.bnf.fr ]

It uses the crawler from Archive.org webarchive.jira.com/wiki/spaces/Heritrix/overview

It is supposed to archive French sites, without respecting the robots.txt (because the law gives them this right!), but my sites are neither in French nor hosted in France...


- - -

[edited by: keyplyr at 1:01 am (utc) on Jun 25, 2018]
[edit reason] obscured private IP address & delinked URL [/edit]

lucy24

9:40 pm on Jun 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is supposed to archive French sites, without respecting the robots.txt (because the law gives them this right!)
Say what now? Does the law also give websites the obligation to let them crawl, or do we still retain the right to physically block them? What if we don't live in France? (As it happens, I personally approve of archiving such as the Wayback Machine, but wtf.)

Dimitri

10:14 pm on Jun 24, 2018 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



I am not yet done reading and assimilating the whole GDPR thing, but it seems that, at some point in a near future, it will be forbidden to block the access to a site, if a visitor refuses cookies? So, if a government is conducting an archiving of the web, I guess they can pass laws of this kind about crawling too ...

But about these archiving, I wonder what it will look like with the EU copyright directive (which I didn't yet finish reading about too).

lucy24

11:49 pm on Jun 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



at some point in a near future, it will be forbidden to block the access to a site, if a visitor refuses cookies?
I don't think that will have any effect on robots, since all they have to do is “accept” the cookie and then quietly throw it away. A cookie is only meaningful on a subsequent request, when you send the cookie back to the originating site. And if at this point the site says “Oi! I know you, and you’re supposed to have a cookie at this point!” ... well, that’s a whole nother field of interesting discussion.

Besides, most of these cookie dialogs involve javascript, which by its nature only kicks in after the page has been sent out. (Once in a blue moon, I’ll do a search and find results whose snippet says in its entirety “This page requires Javascript” or similar. Someone, somewhere, didn’t think this through.)

keyplyr

1:14 am on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Best to block these thieves by both IP range & UA if they're anything like the Internet Archive [webmasterworld.com]

They can reference any BS statutes they like, but using my intellectual property without my explicit permission is copyright infringement, period.

Thanks for the heads-up.

Anyone searched their "archive" to see if they've grabbed your site?

Dimitri

11:16 am on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



I don't think that will have any effect on robot

I meant that, since I understand that, in a near future, there is a possibility, that it becomes illegal for web sites to block users if they refuse cookies (may be I misunderstood) , then one day, it's possible that governments make a law, making it illegal to block their archiving robots. You never know.

keyplyr

11:29 am on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is Bibliotheque Nationale de France the French Government?

It looks like a privately owned company with part of their services acting as some type of library.

Leosghost

11:36 am on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just based upon a quick read of their site and their "missions"( I'm busy today so will confirm later )..their crawler is only supposed to crawl French sites ( those hosted in France..édité ) that have registered with them via what we call the "depot legal"..
they refer to the loi ( law ) DADVSI du 1er août 2006 créant le dépôt légal des sites web. Which only applies to sites hosted in France..It basically treats all sites that are hosted here as being the equivalent of online books, newspapers or pamphlets..
But I'll read the complete law later this evening my time ..and get back to the thread ASAIC.

Meanwhile..IIWY..I'd block with extreme prejudice.

Leosghost

12:02 pm on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Run under the auspices of the Ministry of Culture French Govt..charges for some of it's "services"..reading while I'm eating lunch..
it's crawler should only be crawling Sites hosted in France or by French citizens or which use the .fr and which have registered with the BNF ( supposedly such registration is mandatory , to stop us from anonymously saying scurrilous things about our "betters" ) ..If you are not French , not hosting in France, nor using a .fr..Slam the door in it's face with .htaccess.

lucy24

6:15 pm on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: idly thinking that if I were the bnf, my first priority would be upgrading their godawful old Gallica monochrome low-resolution scans so you could actually use their scanned books, and then worry about websites ten years down the line when that’s all taken care of ::

keyplyr

7:17 pm on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, the higher-def screens get, the worse older low-def imagery looks.

lucy24

8:37 pm on Jun 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gallica scans were always crap. They’re worse than google’s--about on par with the DLI scans that have showed up at TIA in the last few years. Even the best TIA scans (the ones with directory names in -rich or -iala) can’t compare to the scans of manuscripts at places like the British Library, where you can zoom in to almost any magnification. (Is it an expunctuation or is it a flyspeck? Only 3000 DPI can tell.) But for most purposes there’s a tradeoff between quality and bandwidth.