<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>2346</bug_id>
          
          <creation_ts>2005-10-17 13:20:52 +0000</creation_ts>
          <short_desc>Allow link validator to check links to Link, and Markup, Validators</short_desc>
          <delta_ts>2013-11-03 07:34:40 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>LinkChecker</product>
          <component>checklink</component>
          <version>unspecified</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>NEW</bug_status>
          <resolution></resolution>
          
          
          <bug_file_loc>http://validator.w3.org/robots.txt</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords>Usability</keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Otto Stolz">Otto.Stolz</reporter>
          <assigned_to name="Olivier Thereaux">ot</assigned_to>
          <cc>astuart</cc>
    
    <cc>elimerl</cc>
    
    <cc>gonzo1lee</cc>
    
    <cc>sporosbe</cc>
          
          <qa_contact name="qa-dev tracking">www-validator-cvs</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>6709</commentid>
    <comment_count>0</comment_count>
    <who name="Otto Stolz">Otto.Stolz</who>
    <bug_when>2005-10-17 13:20:57 +0000</bug_when>
    <thetext>The W3C link validator balks on any link to http://validator.w3.org/checklink,
or to http://validator.w3.org/check; quote:
  http://validator.w3.org/checklink?uri=...&amp;hide_type=all&amp;depth=&amp;check=Check
    What to do: The link was not checked due to robots exclusion rules.
    Check the link manually.
    Response status code: (N/A)
    Response message: Forbidden by robots.txt
    Line: 425
http://validator.w3.org/check?url=...&amp;outline=
    What to do: The link was not checked due to robots exclusion rules.
    Check the link manually.
    Response status code: (N/A)
    Response message: Forbidden by robots.txt
    Line: 413

Note that the HTML validator even recommends to include, in the
pages to be checked, a link to itself; yet, the link checker on
the very same domain does not check those recommendend links.

Please include in http://validator.w3.org/robots.txt the following code:
  User-Agent: W3C-checklink
  Disallow:</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7215</commentid>
    <comment_count>1</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2005-11-17 17:25:45 +0000</bug_when>
    <thetext>This isn&apos;t an issue with the link checker itself, but rather validator.w3.org  
configuration.  Maybe Olivier has an opinion on this? </thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10421</commentid>
    <comment_count>2</comment_count>
    <who name="Otto Stolz">Otto.Stolz</who>
    <bug_when>2006-07-07 09:54:24 +0000</bug_when>
    <thetext>After more than 8 months, this simple, yet important, entry
in http://validator.w3.org/robots.txt is still missing!

(Though, meanwhile, the error message points to
&lt;http://validator.w3.org/docs/checklink.html#bot&gt;,
where you document what you shhould have done, before.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10436</commentid>
    <comment_count>3</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2006-07-10 06:49:38 +0000</bug_when>
    <thetext>  User-Agent: W3C-checklink
  Disallow:
... would not be a very good idea, if it opens the door to some DOSing of the link checker through recursive requests.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>10565</commentid>
    <comment_count>4</comment_count>
    <who name="Otto Stolz">Otto.Stolz</who>
    <bug_when>2006-07-18 16:32:19 +0000</bug_when>
    <thetext>How could a robots.txt entry influence the link checker&apos;s handling of recursive
requests, at all? Under normal circumstances, a link checker will find many
identical links in its input; so it certainly will keep a list of links already
checked, and any sort of recursive link structure will not be able to get the
link checker into an infinite recursion, or loop.

What I am asking for has nothing to do with the size of the link-checkers task;
it simply tells the link-checker not to balk on links (from client pages) to the
link-checker. Note that your own documentation recommends to place such links
in the client pages -- yet, your link-checker balks on them.

If you are concerned about links pointing into your pages, beyond your
link-ckecker, you certainly can disallow link-checking into your private
directories.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>21714</commentid>
    <comment_count>5</comment_count>
    <who name="Etienne Miret">elimerl</who>
    <bug_when>2008-09-01 09:02:24 +0000</bug_when>
    <thetext>&gt; How could a robots.txt entry influence the link checker&apos;s handling of recursive
&gt; requests, at all? Under normal circumstances, a link checker will find many
&gt; identical links in its input; so it certainly will keep a list of links already
&gt; checked, and any sort of recursive link structure will not be able to get the
&gt; link checker into an infinite recursion, or loop.
A *same* instance of the link checker will remember previously visited links. But in case the link checker send a HEAD request to itself, it will start a new instance of itself, with no links cached.

The following page, if located at http://example.com/recursive, will trigger an infinite loop:
---
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD HTML 4.01//EN&quot; &quot;http://www.w3.org/TR/html4/strict.dtd&quot;&gt;
&lt;html&gt;
&lt;head&gt;&lt;title&gt;Infinite Link Chekcer Loop&lt;/title&gt;&lt;/head&gt;
&lt;body&gt;&lt;p&gt;
&lt;a href=&quot;http://validator.w3.org/checklink?uri=http://example.com/recursive&amp;amp;hide_type=all&amp;amp;depth=&amp;amp;check=Check&quot;
&gt;Check this page&apos;s links&lt;/a&gt;
---

The case of links to the MarkUp Validator is different. As long as the MarkUp Validator isnt able to start instances of itself or of the Link Checker, there are no risks of recursion, and thus no risks of infinite loops. However, if each page of a website contains a validation link &lt;http://validator.w3.org/check?uri=referer&gt;, a recursive link checking of the site will trigger markup validation of *all* its pages, which doesnt seems desirable. While performing markup validation of a full website is surely desirable, the Link Checker is not the appropriate tool to use, the Log Validator is.

Thus, I suggest this bug being marked as WONTFIX.
</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>