3289 – utf8 web site causes tool to break

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3289 - utf8 web site causes tool to break

Summary: utf8 web site causes tool to break

Status:	RESOLVED FIXED

Alias:	None

Product:	LinkChecker
Classification:	Unclassified
Component:	checklink (show other bugs)
Version:	4.2.1
Hardware:	PC Linux

Importance:	P2 major
Target Milestone:	---
Assignee:	Olivier Thereaux
QA Contact:	qa-dev tracking

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-05-17 00:53 UTC by Bruce Altmann
Modified:	2006-05-25 22:27 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Bruce Altmann 2006-05-17 00:53:10 UTC

When checking a utf8 web site, the tool can not handle the encoding and breaks. (even worse - works on some - but can not read, and thus completely misses things)


Firs the GET complains

Parsing undecoded UTF-8 will create garbage in ...5.8.8/Protocols.pm line 114
(then it trys to read what is can)


Then (and if can be the first warning if the header was raw enough to get by)
"Parsing..."
(so clearly in &parse_document)
Complains again
"Parsing undecoded UTF-8 will create garbage in checklink  line #.



search.cpan.org  mentions this error in HTML::Parser
says to Encode::encode_utf8 before calling parse.
(but the example is a little sparse)


Request:

Can you explain to me 2 things for a possible code tweak on my part.
The code seems to know the encoding (web version reports utf8)
Is this correct?  What part of the code identifies this? (or does it jsut read it from the header)

What 2 points in the code need to be told (hey this is utf8)
(either from the code already knowing, or passing this in as a --encoding XXX command line arg)
I assume something before the GET
and something before the parse in &parse_document.

-Bruce

Comment 1 Bruce Altmann 2006-05-19 22:11:56 UTC

Example URL where this breaks
http://www.amd.com/us-en/

How does it break:
Does not read this entire URI page - only parts.
Recursive (-r) then stops at this one page.

Another example URL
http://www.amd.com/gb-uk/

Does a bit better on this tree, but similar issues.
Has problems reading Server header.
Does not read the entire URI page, thus not check all links or all linked pages in recursive.





(see note in orig comment about HTML::Parser talking about this issue on search.cpan.org)

Comment 2 Ville Skyttä 2006-05-25 08:54:20 UTC

The checking of http://www.amd.com/us-en/ seems to end halfway through on validator.w3.org indeed.  However, that doesn't happen on qa-dev or my local  box, even though they display the UTF-8 garbage warning too.  Enabling UTF-8 mode in HTML::Parser avoids the warning, but doesn't appear to fix the actual problem.

Based on that, I'm inclined towards blaming Perl or HTML::Parser and reassigning to Olivier for comments about upgrade possibilities.  Related versions:
- validator.w3.org: perl 5.8.4, HTML::Parser 3.45
- qa-dev.w3.org: perl 5.8.8, HTML::Parser 3.54
- my local box: perl 5.8.8, HTML::Parser 3.51

Anyway, I have enabled UTF-8 mode in the CVS version of the link checker.  If I understand the docs correctly it should be a good thing to do in any case.

Reproducers:
http://validator.w3.org/checklink?uri=http%3A%2F%2Fwww.amd.com%2Fus-en%2F&hide_type=all&depth=&check=Check
http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fwww.amd.com%2Fus-en%2F&hide_type=all&depth=&check=Check

Comment 3 Olivier Thereaux 2006-05-25 22:27:52 UTC

That was a correct diagnosis, Ville. Upgrade of HTML::Parser on both servers of validator.w3.org seems to have done the job.