893 – cache (non) existence

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 893 - cache (non) existence

Summary: cache (non) existence

Status:	RESOLVED FIXED

Alias:	None

Product:	LinkChecker
Classification:	Unclassified
Component:	checklink (show other bugs)
Version:	4.0
Hardware:	All All

Importance:	P1 normal
Target Milestone:	4.1
Assignee:	Ville Skyttä
QA Contact:	qa-dev tracking

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2004-09-27 17:05 UTC by Mark Ludwig
Modified:	2004-11-07 09:48 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Mark Ludwig 2004-09-27 17:05:01 UTC

128.30.52.13 - - [27/Sep/2004:11:45:30 -0400] "GET /robots.txt HTTP/1.1" 200 26

Why did your site probe the robots.txt file on my server
ublib.buffalo.edu 120 times this morning between 8:55 and 11:45?

It has been doing this since last week and it keeps on probing
whether or not the robots.txt file exists.

Comment 1 Olivier Thereaux 2004-10-06 02:00:00 UTC

Most likely this is the Link Checker doing this, not the CSS validator.
Our site is not "probing" yours. 

Someone (possibly someone local to you, or someone with a site linking to yours) is certainly checking 
links to your site, and the link checker is following the robots exclusion protocol and doing your server 
a favor in doing so.

That said, a possible enhancement would be that the link checker cache the existence/lack of robots.txt 
for a given site instead of querying for it again and again.

Reassigning to proper product and owner.

Comment 2 Ville Skyttä 2004-10-11 22:13:49 UTC

Right, the /robots.txt fetches should be cached, and actually as far as the low
level implementation (LWP::RobotUA) is concerned, they _are_ cached.

But in the current version of the current link checker codebase, we're
instantiating several W3C::UserAgent (a superclass of LWP::RobotUA) objects per
link checker run, and the /robots.txt information cache is not shared between
these instances by default; instead, every one of them maintains its own small
cache, practically resulting in very little caching, if at all :(

The real fix would be to instantiate exactly one W3C::UserAgent per link checker
run and use that for fetching all links (unless we want to do parallel fetching
sometime), but that is a very intrusive change and will most likely have to wait
until the next major link checker version.

However, I believe it is possible to come up with an interim solution by
managing a "global" WWW::RobotRules object ourselves and passing that to all
instantiated UserAgents.  I'll look into it.

Comment 3 Ville Skyttä 2004-10-12 10:36:04 UTC

Turns out to be that the most trivial one of the workarounds is not possible due
to a bug in upstream WWW::RobotRules.  Fix for that already sent to libwww-perl
mailing list, no comments yet; will think about other workaround alternatives in
the meantime.

Comment 4 Ville Skyttä 2004-11-07 09:48:05 UTC

Fixed in CVS by using the same W3C::UserAgent instance for all retrievals.
It ain't pretty, but it works...