893 2004-09-27 17:05:01 +0000 cache (non) existence 2004-11-07 09:48:05 +0000 1 1 1 Unclassified LinkChecker checklink 4.0 All All RESOLVED FIXED P1 normal 4.1 1 uldmjl ville.skytta www-validator-cvs oldest_to_newest 2373 0 uldmjl 2004-09-27 17:05:01 +0000 128.30.52.13 - - [27/Sep/2004:11:45:30 -0400] "GET /robots.txt HTTP/1.1" 200 26 Why did your site probe the robots.txt file on my server ublib.buffalo.edu 120 times this morning between 8:55 and 11:45? It has been doing this since last week and it keeps on probing whether or not the robots.txt file exists. 2470 1 ot 2004-10-06 02:00:00 +0000 Most likely this is the Link Checker doing this, not the CSS validator. Our site is not "probing" yours. Someone (possibly someone local to you, or someone with a site linking to yours) is certainly checking links to your site, and the link checker is following the robots exclusion protocol and doing your server a favor in doing so. That said, a possible enhancement would be that the link checker cache the existence/lack of robots.txt for a given site instead of querying for it again and again. Reassigning to proper product and owner. 2499 2 ville.skytta 2004-10-11 22:13:49 +0000 Right, the /robots.txt fetches should be cached, and actually as far as the low level implementation (LWP::RobotUA) is concerned, they _are_ cached. But in the current version of the current link checker codebase, we're instantiating several W3C::UserAgent (a superclass of LWP::RobotUA) objects per link checker run, and the /robots.txt information cache is not shared between these instances by default; instead, every one of them maintains its own small cache, practically resulting in very little caching, if at all :( The real fix would be to instantiate exactly one W3C::UserAgent per link checker run and use that for fetching all links (unless we want to do parallel fetching sometime), but that is a very intrusive change and will most likely have to wait until the next major link checker version. However, I believe it is possible to come up with an interim solution by managing a "global" WWW::RobotRules object ourselves and passing that to all instantiated UserAgents. I'll look into it. 2507 3 ville.skytta 2004-10-12 10:36:04 +0000 Turns out to be that the most trivial one of the workarounds is not possible due to a bug in upstream WWW::RobotRules. Fix for that already sent to libwww-perl mailing list, no comments yet; will think about other workaround alternatives in the meantime. 2594 4 ville.skytta 2004-11-07 09:48:05 +0000 Fixed in CVS by using the same W3C::UserAgent instance for all retrievals. It ain't pretty, but it works...