3289 2006-05-17 00:53:10 +0000 utf8 web site causes tool to break 2006-05-25 22:27:52 +0000 1 1 1 Unclassified LinkChecker checklink 4.2.1 PC Linux RESOLVED FIXED P2 major --- 1 bruce ot www-validator-cvs oldest_to_newest 9813 0 bruce 2006-05-17 00:53:10 +0000 When checking a utf8 web site, the tool can not handle the encoding and breaks. (even worse - works on some - but can not read, and thus completely misses things) Firs the GET complains Parsing undecoded UTF-8 will create garbage in ...5.8.8/Protocols.pm line 114 (then it trys to read what is can) Then (and if can be the first warning if the header was raw enough to get by) "Parsing..." (so clearly in &parse_document) Complains again "Parsing undecoded UTF-8 will create garbage in checklink line #. search.cpan.org mentions this error in HTML::Parser says to Encode::encode_utf8 before calling parse. (but the example is a little sparse) Request: Can you explain to me 2 things for a possible code tweak on my part. The code seems to know the encoding (web version reports utf8) Is this correct? What part of the code identifies this? (or does it jsut read it from the header) What 2 points in the code need to be told (hey this is utf8) (either from the code already knowing, or passing this in as a --encoding XXX command line arg) I assume something before the GET and something before the parse in &parse_document. -Bruce 9872 1 bruce 2006-05-19 22:11:56 +0000 Example URL where this breaks http://www.amd.com/us-en/ How does it break: Does not read this entire URI page - only parts. Recursive (-r) then stops at this one page. Another example URL http://www.amd.com/gb-uk/ Does a bit better on this tree, but similar issues. Has problems reading Server header. Does not read the entire URI page, thus not check all links or all linked pages in recursive. (see note in orig comment about HTML::Parser talking about this issue on search.cpan.org) 9892 2 ville.skytta 2006-05-25 08:54:20 +0000 The checking of http://www.amd.com/us-en/ seems to end halfway through on validator.w3.org indeed. However, that doesn't happen on qa-dev or my local box, even though they display the UTF-8 garbage warning too. Enabling UTF-8 mode in HTML::Parser avoids the warning, but doesn't appear to fix the actual problem. Based on that, I'm inclined towards blaming Perl or HTML::Parser and reassigning to Olivier for comments about upgrade possibilities. Related versions: - validator.w3.org: perl 5.8.4, HTML::Parser 3.45 - qa-dev.w3.org: perl 5.8.8, HTML::Parser 3.54 - my local box: perl 5.8.8, HTML::Parser 3.51 Anyway, I have enabled UTF-8 mode in the CVS version of the link checker. If I understand the docs correctly it should be a good thing to do in any case. Reproducers: http://validator.w3.org/checklink?uri=http%3A%2F%2Fwww.amd.com%2Fus-en%2F&hide_type=all&depth=&check=Check http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fwww.amd.com%2Fus-en%2F&hide_type=all&depth=&check=Check 9898 3 ot 2006-05-25 22:27:52 +0000 That was a correct diagnosis, Ville. Upgrade of HTML::Parser on both servers of validator.w3.org seems to have done the job.