[Bug 9628] New: "willful violation" for detecting the charset

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628

           Summary: "willful violation" for detecting the charset
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec bugs
        AssignedTo: dave.null@w3.org
        ReportedBy: julian.reschke@gmx.de
         QAContact: public-html-bugzilla@w3.org
                CC: ian@hixie.ch, mike@w3.org, public-html@w3.org


>From http://dev.w3.org/html5/spec/infrastructure.html#content-type-sniffing:

"The algorithm for extracting an encoding from a Content-Type, given a string
s, is as follows. It either returns an encoding or nothing.

   1.

      Find the first seven characters in s that are an ASCII case-insensitive
match for the word "charset". If no such match is found, return nothing.
   2.

      Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that
immediately follow the word "charset" (there might not be any).
   3.

      If the next character is not a U+003D EQUALS SIGN ('='), return nothing
and abort these steps.
   4.

      Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that
immediately follow the equals sign (there might not be any).
   5.

      Process the next character as follows:

      If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022
QUOTATION MARK ('"') in s
      If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE
("'") in s
          Return the encoding corresponding to the string between this
character and the next earliest occurrence of this character.
      If it is an unmatched U+0022 QUOTATION MARK ('"')
      If it is an unmatched U+0027 APOSTROPHE ("'")
      If there is no next character
          Return nothing.
      Otherwise
          Return the encoding corresponding to the string from this character
to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the
end of s, whichever comes first.

Note: This requirement is a willful violation of the HTTP specification,
motivated by the need for backwards compatibility with legacy content. [HTTP]"

General problems:

(1) This algorithm doesn't seem to be used.

(2) It's VERY unfriendly to the reader to claim that there's a violation of the
HTTP spec without saying what it is.

Specific problems:

(3) The algorithm requires allowing single quotes; this is indeed a violation
of the HTTP syntax. I just checked with IE8; it doesn't allow single quotes.
Thus, the claim "needed for backwards compatibility" appears to be incorrect.

(4) The spec also violates HTTP in that the backslash character inside quoted
values isn't treated properly. If this is needed "for compatibility", this
should be backed up with data.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Friday, 30 April 2010 14:49:50 UTC