This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 893 - cache (non) existence
Summary: cache (non) existence
Status: RESOLVED FIXED
Alias: None
Product: LinkChecker
Classification: Unclassified
Component: checklink (show other bugs)
Version: 4.0
Hardware: All All
: P1 normal
Target Milestone: 4.1
Assignee: Ville Skyttä
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-09-27 17:05 UTC by Mark Ludwig
Modified: 2004-11-07 09:48 UTC (History)
0 users

See Also:


Attachments

Description Mark Ludwig 2004-09-27 17:05:01 UTC
128.30.52.13 - - [27/Sep/2004:11:45:30 -0400] "GET /robots.txt HTTP/1.1" 200 26

Why did your site probe the robots.txt file on my server
ublib.buffalo.edu 120 times this morning between 8:55 and 11:45?

It has been doing this since last week and it keeps on probing
whether or not the robots.txt file exists.
Comment 1 Olivier Thereaux 2004-10-06 02:00:00 UTC
Most likely this is the Link Checker doing this, not the CSS validator.
Our site is not "probing" yours. 

Someone (possibly someone local to you, or someone with a site linking to yours) is certainly checking 
links to your site, and the link checker is following the robots exclusion protocol and doing your server 
a favor in doing so.

That said, a possible enhancement would be that the link checker cache the existence/lack of robots.txt 
for a given site instead of querying for it again and again.

Reassigning to proper product and owner.
Comment 2 Ville Skyttä 2004-10-11 22:13:49 UTC
Right, the /robots.txt fetches should be cached, and actually as far as the low
level implementation (LWP::RobotUA) is concerned, they _are_ cached.

But in the current version of the current link checker codebase, we're
instantiating several W3C::UserAgent (a superclass of LWP::RobotUA) objects per
link checker run, and the /robots.txt information cache is not shared between
these instances by default; instead, every one of them maintains its own small
cache, practically resulting in very little caching, if at all :(

The real fix would be to instantiate exactly one W3C::UserAgent per link checker
run and use that for fetching all links (unless we want to do parallel fetching
sometime), but that is a very intrusive change and will most likely have to wait
until the next major link checker version.

However, I believe it is possible to come up with an interim solution by
managing a "global" WWW::RobotRules object ourselves and passing that to all
instantiated UserAgents.  I'll look into it.
Comment 3 Ville Skyttä 2004-10-12 10:36:04 UTC
Turns out to be that the most trivial one of the workarounds is not possible due
to a bug in upstream WWW::RobotRules.  Fix for that already sent to libwww-perl
mailing list, no comments yet; will think about other workaround alternatives in
the meantime.
Comment 4 Ville Skyttä 2004-11-07 09:48:05 UTC
Fixed in CVS by using the same W3C::UserAgent instance for all retrievals.
It ain't pretty, but it works...