This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2346 - Allow link validator to check links to Link, and Markup, Validators
Summary: Allow link validator to check links to Link, and Markup, Validators
Status: NEW
Alias: None
Product: LinkChecker
Classification: Unclassified
Component: checklink (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Olivier Thereaux
QA Contact: qa-dev tracking
URL: http://validator.w3.org/robots.txt
Whiteboard:
Keywords: Usability
Depends on:
Blocks:
 
Reported: 2005-10-17 13:20 UTC by Otto Stolz
Modified: 2013-11-03 07:34 UTC (History)
4 users (show)

See Also:


Attachments

Description Otto Stolz 2005-10-17 13:20:57 UTC
The W3C link validator balks on any link to http://validator.w3.org/checklink,
or to http://validator.w3.org/check; quote:
  http://validator.w3.org/checklink?uri=...&hide_type=all&depth=&check=Check
    What to do: The link was not checked due to robots exclusion rules.
    Check the link manually.
    Response status code: (N/A)
    Response message: Forbidden by robots.txt
    Line: 425
http://validator.w3.org/check?url=...&outline=
    What to do: The link was not checked due to robots exclusion rules.
    Check the link manually.
    Response status code: (N/A)
    Response message: Forbidden by robots.txt
    Line: 413

Note that the HTML validator even recommends to include, in the
pages to be checked, a link to itself; yet, the link checker on
the very same domain does not check those recommendend links.

Please include in http://validator.w3.org/robots.txt the following code:
  User-Agent: W3C-checklink
  Disallow:
Comment 1 Ville Skyttä 2005-11-17 17:25:45 UTC
This isn't an issue with the link checker itself, but rather validator.w3.org  
configuration.  Maybe Olivier has an opinion on this? 
Comment 2 Otto Stolz 2006-07-07 09:54:24 UTC
After more than 8 months, this simple, yet important, entry
in http://validator.w3.org/robots.txt is still missing!

(Though, meanwhile, the error message points to
<http://validator.w3.org/docs/checklink.html#bot>,
where you document what you shhould have done, before.)
Comment 3 Olivier Thereaux 2006-07-10 06:49:38 UTC
  User-Agent: W3C-checklink
  Disallow:
... would not be a very good idea, if it opens the door to some DOSing of the link checker through recursive requests.
Comment 4 Otto Stolz 2006-07-18 16:32:19 UTC
How could a robots.txt entry influence the link checker's handling of recursive
requests, at all? Under normal circumstances, a link checker will find many
identical links in its input; so it certainly will keep a list of links already
checked, and any sort of recursive link structure will not be able to get the
link checker into an infinite recursion, or loop.

What I am asking for has nothing to do with the size of the link-checkers task;
it simply tells the link-checker not to balk on links (from client pages) to the
link-checker. Note that your own documentation recommends to place such links
in the client pages -- yet, your link-checker balks on them.

If you are concerned about links pointing into your pages, beyond your
link-ckecker, you certainly can disallow link-checking into your private
directories.
Comment 5 Etienne Miret 2008-09-01 09:02:24 UTC
> How could a robots.txt entry influence the link checker's handling of recursive
> requests, at all? Under normal circumstances, a link checker will find many
> identical links in its input; so it certainly will keep a list of links already
> checked, and any sort of recursive link structure will not be able to get the
> link checker into an infinite recursion, or loop.
A *same* instance of the link checker will remember previously visited links. But in case the link checker send a HEAD request to itself, it will start a new instance of itself, with no links cached.

The following page, if located at http://example.com/recursive, will trigger an infinite loop:
---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head><title>Infinite Link Chekcer Loop</title></head>
<body><p>
<a href="http://validator.w3.org/checklink?uri=http://example.com/recursive&amp;hide_type=all&amp;depth=&amp;check=Check"
>Check this page's links</a>
---

The case of links to the MarkUp Validator is different. As long as the MarkUp Validator isnt able to start instances of itself or of the Link Checker, there are no risks of recursion, and thus no risks of infinite loops. However, if each page of a website contains a validation link <http://validator.w3.org/check?uri=referer>, a recursive link checking of the site will trigger markup validation of *all* its pages, which doesnt seems desirable. While performing markup validation of a full website is surely desirable, the Link Checker is not the appropriate tool to use, the Log Validator is.

Thus, I suggest this bug being marked as WONTFIX.