This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The W3C link validator balks on any link to http://validator.w3.org/checklink, or to http://validator.w3.org/check; quote: http://validator.w3.org/checklink?uri=...&hide_type=all&depth=&check=Check What to do: The link was not checked due to robots exclusion rules. Check the link manually. Response status code: (N/A) Response message: Forbidden by robots.txt Line: 425 http://validator.w3.org/check?url=...&outline= What to do: The link was not checked due to robots exclusion rules. Check the link manually. Response status code: (N/A) Response message: Forbidden by robots.txt Line: 413 Note that the HTML validator even recommends to include, in the pages to be checked, a link to itself; yet, the link checker on the very same domain does not check those recommendend links. Please include in http://validator.w3.org/robots.txt the following code: User-Agent: W3C-checklink Disallow:
This isn't an issue with the link checker itself, but rather validator.w3.org configuration. Maybe Olivier has an opinion on this?
After more than 8 months, this simple, yet important, entry in http://validator.w3.org/robots.txt is still missing! (Though, meanwhile, the error message points to <http://validator.w3.org/docs/checklink.html#bot>, where you document what you shhould have done, before.)
User-Agent: W3C-checklink Disallow: ... would not be a very good idea, if it opens the door to some DOSing of the link checker through recursive requests.
How could a robots.txt entry influence the link checker's handling of recursive requests, at all? Under normal circumstances, a link checker will find many identical links in its input; so it certainly will keep a list of links already checked, and any sort of recursive link structure will not be able to get the link checker into an infinite recursion, or loop. What I am asking for has nothing to do with the size of the link-checkers task; it simply tells the link-checker not to balk on links (from client pages) to the link-checker. Note that your own documentation recommends to place such links in the client pages -- yet, your link-checker balks on them. If you are concerned about links pointing into your pages, beyond your link-ckecker, you certainly can disallow link-checking into your private directories.
> How could a robots.txt entry influence the link checker's handling of recursive > requests, at all? Under normal circumstances, a link checker will find many > identical links in its input; so it certainly will keep a list of links already > checked, and any sort of recursive link structure will not be able to get the > link checker into an infinite recursion, or loop. A *same* instance of the link checker will remember previously visited links. But in case the link checker send a HEAD request to itself, it will start a new instance of itself, with no links cached. The following page, if located at http://example.com/recursive, will trigger an infinite loop: --- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head><title>Infinite Link Chekcer Loop</title></head> <body><p> <a href="http://validator.w3.org/checklink?uri=http://example.com/recursive&hide_type=all&depth=&check=Check" >Check this page's links</a> --- The case of links to the MarkUp Validator is different. As long as the MarkUp Validator isnt able to start instances of itself or of the Link Checker, there are no risks of recursion, and thus no risks of infinite loops. However, if each page of a website contains a validation link <http://validator.w3.org/check?uri=referer>, a recursive link checking of the site will trigger markup validation of *all* its pages, which doesnt seems desirable. While performing markup validation of a full website is surely desirable, the Link Checker is not the appropriate tool to use, the Log Validator is. Thus, I suggest this bug being marked as WONTFIX.