Forum Moderators: goodroi
Any URI based transfer protocol can use robots.txt. For example, it's not limited to HTTP anymore and can be used for FTP or CoAP as well. Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers. A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren't overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time. The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.
...we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.
We also included a testing tool in the open source package to help you test a few rules. Once built, the usage is very straightforward...