<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>3289</bug_id>
          
          <creation_ts>2006-05-17 00:53:10 +0000</creation_ts>
          <short_desc>utf8 web site causes tool to break</short_desc>
          <delta_ts>2006-05-25 22:27:52 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>LinkChecker</product>
          <component>checklink</component>
          <version>4.2.1</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>major</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Bruce Altmann">bruce</reporter>
          <assigned_to name="Olivier Thereaux">ot</assigned_to>
          
          
          <qa_contact name="qa-dev tracking">www-validator-cvs</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>9813</commentid>
    <comment_count>0</comment_count>
    <who name="Bruce Altmann">bruce</who>
    <bug_when>2006-05-17 00:53:10 +0000</bug_when>
    <thetext>When checking a utf8 web site, the tool can not handle the encoding and breaks. (even worse - works on some - but can not read, and thus completely misses things)


Firs the GET complains

Parsing undecoded UTF-8 will create garbage in ...5.8.8/Protocols.pm line 114
(then it trys to read what is can)


Then (and if can be the first warning if the header was raw enough to get by)
&quot;Parsing...&quot;
(so clearly in &amp;parse_document)
Complains again
&quot;Parsing undecoded UTF-8 will create garbage in checklink  line #.



search.cpan.org  mentions this error in HTML::Parser
says to Encode::encode_utf8 before calling parse.
(but the example is a little sparse)


Request:

Can you explain to me 2 things for a possible code tweak on my part.
The code seems to know the encoding (web version reports utf8)
Is this correct?  What part of the code identifies this? (or does it jsut read it from the header)

What 2 points in the code need to be told (hey this is utf8)
(either from the code already knowing, or passing this in as a --encoding XXX command line arg)
I assume something before the GET
and something before the parse in &amp;parse_document.

-Bruce</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>9872</commentid>
    <comment_count>1</comment_count>
    <who name="Bruce Altmann">bruce</who>
    <bug_when>2006-05-19 22:11:56 +0000</bug_when>
    <thetext>Example URL where this breaks
http://www.amd.com/us-en/

How does it break:
Does not read this entire URI page - only parts.
Recursive (-r) then stops at this one page.

Another example URL
http://www.amd.com/gb-uk/

Does a bit better on this tree, but similar issues.
Has problems reading Server header.
Does not read the entire URI page, thus not check all links or all linked pages in recursive.





(see note in orig comment about HTML::Parser talking about this issue on search.cpan.org)



</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>9892</commentid>
    <comment_count>2</comment_count>
    <who name="Ville Skyttä">ville.skytta</who>
    <bug_when>2006-05-25 08:54:20 +0000</bug_when>
    <thetext>The checking of http://www.amd.com/us-en/ seems to end halfway through on validator.w3.org indeed.  However, that doesn&apos;t happen on qa-dev or my local  box, even though they display the UTF-8 garbage warning too.  Enabling UTF-8 mode in HTML::Parser avoids the warning, but doesn&apos;t appear to fix the actual problem.

Based on that, I&apos;m inclined towards blaming Perl or HTML::Parser and reassigning to Olivier for comments about upgrade possibilities.  Related versions:
- validator.w3.org: perl 5.8.4, HTML::Parser 3.45
- qa-dev.w3.org: perl 5.8.8, HTML::Parser 3.54
- my local box: perl 5.8.8, HTML::Parser 3.51

Anyway, I have enabled UTF-8 mode in the CVS version of the link checker.  If I understand the docs correctly it should be a good thing to do in any case.

Reproducers:
http://validator.w3.org/checklink?uri=http%3A%2F%2Fwww.amd.com%2Fus-en%2F&amp;hide_type=all&amp;depth=&amp;check=Check
http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fwww.amd.com%2Fus-en%2F&amp;hide_type=all&amp;depth=&amp;check=Check</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>9898</commentid>
    <comment_count>3</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2006-05-25 22:27:52 +0000</bug_when>
    <thetext>That was a correct diagnosis, Ville. Upgrade of HTML::Parser on both servers of validator.w3.org seems to have done the job.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>