This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 40 - Charset defaulting behaviour.
Summary: Charset defaulting behaviour.
Status: RESOLVED INVALID
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: 0.6.0b1
Hardware: Other other
: P2 normal
Target Milestone: ---
Assignee: Terje Bless
QA Contact:
URL: http://crism.maden.org/
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-10-26 19:22 UTC by Terje Bless
Modified: 2002-10-26 23:44 UTC (History)
0 users

See Also:


Attachments

Description Terje Bless 2002-10-26 19:22:05 UTC
Reported by Christopher R. Maden:

When unable to detect an encoding, the new validator should use the 
prescribed defaults, which I believe still means ISO8859-1 for text/html 
over HTTP, and UTF-8 or UTF-16 for XHTML documents uploaded directly.

With the simple interface, validating <URL: http://crism.maden.org/ > 
reports that it is unable to detect the encoding, including using Appendix 
F of XML 1.0.  Using Appendix F is inappropriate for a document delivered 
over HTTP, since the HTTP headers take precedence (and thus it should be 
interpreted as ISO8859-1), but even so, using the Appendix F algorithm 
should result in a determination of UTF-8.  Either way, since this page is 
7-bit ASCII, the validation ought to work.
Comment 1 Terje Bless 2002-10-26 19:44:53 UTC
The HTTP specification does indeed specify ISO-8859-1 as the default value in
the absense of a "charset" parameter in the Content-Type header. However HTTP
and HTML 4.01 are in direct conflict here as the latter proscribes any
assumption about a default character encoding. And since a file upload is still
a HTTP transaction, although we do not normally think of it that way, the same
applies for any file upload with a text/html media type.

The algorithm in Appendix F of the XML Recommendation describes ways to attempt
to automatically detect the character encoding in use in the absence of
information from a higher level protocol. Since the HTTP transaction contained
no encoding information, we attempted the Appendix F algorithm. That algorithm
however, is intended for XML; and as such it requires either the presence of a
UNICODE Byte Order Mark, or an XML Declaration. In particular, if there is no
BOM, we look for the bit patterns that represent the characters "<?xml" in
various encodings.