<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>40</bug_id>
          
          <creation_ts>2002-10-26 19:22:05 +0000</creation_ts>
          <short_desc>Charset defaulting behaviour.</short_desc>
          <delta_ts>2002-10-26 23:44:53 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>Validator</product>
          <component>check</component>
          <version>0.6.0b1</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>INVALID</resolution>
          
          
          <bug_file_loc>http://crism.maden.org/</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Terje Bless">link</reporter>
          <assigned_to name="Terje Bless">link</assigned_to>
          
          
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>65</commentid>
    <comment_count>0</comment_count>
    <who name="Terje Bless">link</who>
    <bug_when>2002-10-26 19:22:05 +0000</bug_when>
    <thetext>Reported by Christopher R. Maden:

When unable to detect an encoding, the new validator should use the 
prescribed defaults, which I believe still means ISO8859-1 for text/html 
over HTTP, and UTF-8 or UTF-16 for XHTML documents uploaded directly.

With the simple interface, validating &lt;URL: http://crism.maden.org/ &gt; 
reports that it is unable to detect the encoding, including using Appendix 
F of XML 1.0.  Using Appendix F is inappropriate for a document delivered 
over HTTP, since the HTTP headers take precedence (and thus it should be 
interpreted as ISO8859-1), but even so, using the Appendix F algorithm 
should result in a determination of UTF-8.  Either way, since this page is 
7-bit ASCII, the validation ought to work.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66</commentid>
    <comment_count>1</comment_count>
    <who name="Terje Bless">link</who>
    <bug_when>2002-10-26 19:44:53 +0000</bug_when>
    <thetext>The HTTP specification does indeed specify ISO-8859-1 as the default value in
the absense of a &quot;charset&quot; parameter in the Content-Type header. However HTTP
and HTML 4.01 are in direct conflict here as the latter proscribes any
assumption about a default character encoding. And since a file upload is still
a HTTP transaction, although we do not normally think of it that way, the same
applies for any file upload with a text/html media type.

The algorithm in Appendix F of the XML Recommendation describes ways to attempt
to automatically detect the character encoding in use in the absence of
information from a higher level protocol. Since the HTTP transaction contained
no encoding information, we attempted the Appendix F algorithm. That algorithm
however, is intended for XML; and as such it requires either the presence of a
UNICODE Byte Order Mark, or an XML Declaration. In particular, if there is no
BOM, we look for the bit patterns that represent the characters &quot;&lt;?xml&quot; in
various encodings.
</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>