40 – Charset defaulting behaviour.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 40 - Charset defaulting behaviour.

Summary: Charset defaulting behaviour.

Status:	RESOLVED INVALID

Alias:	None

Product:	Validator
Classification:	Unclassified
Component:	check (show other bugs)
Version:	0.6.0b1
Hardware:	Other other

Importance:	P2 normal
Target Milestone:	---
Assignee:	Terje Bless
QA Contact:

URL:	http://crism.maden.org/
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2002-10-26 19:22 UTC by Terje Bless
Modified:	2002-10-26 23:44 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Terje Bless 2002-10-26 19:22:05 UTC

Reported by Christopher R. Maden:

When unable to detect an encoding, the new validator should use the 
prescribed defaults, which I believe still means ISO8859-1 for text/html 
over HTTP, and UTF-8 or UTF-16 for XHTML documents uploaded directly.

With the simple interface, validating <URL: http://crism.maden.org/ > 
reports that it is unable to detect the encoding, including using Appendix 
F of XML 1.0.  Using Appendix F is inappropriate for a document delivered 
over HTTP, since the HTTP headers take precedence (and thus it should be 
interpreted as ISO8859-1), but even so, using the Appendix F algorithm 
should result in a determination of UTF-8.  Either way, since this page is 
7-bit ASCII, the validation ought to work.

Comment 1 Terje Bless 2002-10-26 19:44:53 UTC

The HTTP specification does indeed specify ISO-8859-1 as the default value in
the absense of a "charset" parameter in the Content-Type header. However HTTP
and HTML 4.01 are in direct conflict here as the latter proscribes any
assumption about a default character encoding. And since a file upload is still
a HTTP transaction, although we do not normally think of it that way, the same
applies for any file upload with a text/html media type.

The algorithm in Appendix F of the XML Recommendation describes ways to attempt
to automatically detect the character encoding in use in the absence of
information from a higher level protocol. Since the HTTP transaction contained
no encoding information, we attempted the Appendix F algorithm. That algorithm
however, is intended for XML; and as such it requires either the presence of a
UNICODE Byte Order Mark, or an XML Declaration. In particular, if there is no
BOM, we look for the bit patterns that represent the characters "<?xml" in
various encodings.