This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
From http://dev.w3.org/html5/spec/infrastructure.html#content-type-sniffing: "The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns an encoding or nothing. 1. Find the first seven characters in s that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing. 2. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the word "charset" (there might not be any). 3. If the next character is not a U+003D EQUALS SIGN ('='), return nothing and abort these steps. 4. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the equals sign (there might not be any). 5. Process the next character as follows: If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s Return the encoding corresponding to the string between this character and the next earliest occurrence of this character. If it is an unmatched U+0022 QUOTATION MARK ('"') If it is an unmatched U+0027 APOSTROPHE ("'") If there is no next character Return nothing. Otherwise Return the encoding corresponding to the string from this character to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the end of s, whichever comes first. Note: This requirement is a willful violation of the HTTP specification, motivated by the need for backwards compatibility with legacy content. [HTTP]" General problems: (1) This algorithm doesn't seem to be used. (2) It's VERY unfriendly to the reader to claim that there's a violation of the HTTP spec without saying what it is. Specific problems: (3) The algorithm requires allowing single quotes; this is indeed a violation of the HTTP syntax. I just checked with IE8; it doesn't allow single quotes. Thus, the claim "needed for backwards compatibility" appears to be incorrect. (4) The spec also violates HTTP in that the backslash character inside quoted values isn't treated properly. If this is needed "for compatibility", this should be backed up with data.
(5) The algorithm fails to parse headers that have an additional parameter starting with the letters "charset", such as Content-Type: text/plain; charsetfoo=bar; charset=UTF-8 (or a preceding parameter that happens to have "charset" as value).
(In reply to comment #0) > General problems: > > (1) This algorithm doesn't seem to be used. OK, I just noticed that the back ref tool top only works as expected when using the single-page version. So this is used, and thus fixing the other issues is even more important.
Safari's (closed source) network back-end parses Content-Type for charset in a manner that's different from both RFC2616 and the current HTML5 spec. We also have a separate parser in WebKit that's used in a limited set of circumstances (only for XMLHttpRequest.overrideMimeType and for correcting a MIME type set via XMLHttpRequest.serRequestHeader). But I am no aware of any evidence against following RFC2616 to the letter when getting charset out of Content-Type. > Content-Type: text/plain; charsetfoo=bar; charset=UTF-8 Also, this algorithm fails to ignore charset in "Content-Type: text/plain; foocharset=UTF-8".
(1) It's used in several places, click the name of the algorithm for a list. (2) Isn't the violation pretty obvious? I don't really know how to begin describing it, I mean, the whole algorithm is in its entirety one big violation, no? (3) It is my understanding that, notwithstanding your experience with IE, there are pages depending on this. However, I may be mistaken. If you have detailed research on this topic, that would be most welcome. I'm happy to change this algorithm in response to such data. (4, 5, etc) The algorithm violates HTTP in a huge bunch of ways, backslashes aren't the half of it. If I recall correctly, this algorithm was based directly on one of the prominent implementations. I do not recall which. I'm happy to update it if there is clear evidence that the current algorithm is not compatible with legacy content. As a general note, please file just one bug per issue. Thanks. EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: http://dev.w3.org/html5/decision-policy/decision-policy.html Status: Rejected Change Description: no spec change Rationale: Insufficient data.
If you do believe that you need to "willfully" violate the spec, then it's you who needs to back this up with data. I have already shown that one aspect (single quotes) isn't supported in IE8, Alexey has stated that they do not implement this either (more data on this would indeed be useful), and also the algorithm fails to properly parse valid values.
Created attachment 908 [details] content type with encoding in single quotes "asis" document to be served with httpd/mod_asis
Created attachment 909 [details] content type with encoding in double quotes "asis" document to be served with httpd/mod_asis
(In reply to comment #4) > (3) It is my understanding that, notwithstanding your experience with IE, there > are pages depending on this. However, I may be mistaken. If you have detailed > research on this topic, that would be most welcome. I'm happy to change this > algorithm in response to such data. I just added two test cases that show that IE8 does not parse encoding in single quotes, as required by the spec. Thus, violating the spec is not needed for "web compatibility".
(In reply to comment #8) > (In reply to comment #4) > > (3) It is my understanding that, notwithstanding your experience with IE, there > > are pages depending on this. However, I may be mistaken. If you have detailed > > research on this topic, that would be most welcome. I'm happy to change this > > algorithm in response to such data. > > I just added two test cases that show that IE8 does not parse encoding in > single quotes, as required by the spec. Thus, violating the spec is not needed > for "web compatibility". Where with "the spec" I mean RFC 2616.
Created attachment 910 [details] test case with charsetfoo parameter "asis" document to be served with httpd/mod_asis This test case shows that Firefox ignores an unknown parameter, even though it starts with the sequence "charset".
None of these tests are testing what the spec is doing here. This has nothing to do with HTTP. If you click the name of the algorithm you will get a list of the places that use this. Since I was looking at this part of the spec, though, I went ahead and redid a round of testing to see how the browsers were doing today. I didn't have IE handy, but looking at the other major browsers, it looks like all but Firefox skip past unknown parameters, so I've changed the spec to support that. WebKit and Opera's main sources of bugs are that they do extra trimming that the spec doesn't expect. Both also do weird things with unmatched single quotes that they don't do with unmatched double quotes, which is weird. WebKit also seems to have trouble with """. Gecko doesn't drop charset declarations when they have unmatched quotes, but otherwise seems to match the spec. I haven't changed the spec to match these quirks, since it's not clear whether pages depend on them or not. My test suite is here, in case anyone can test IE9: http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: http://dev.w3.org/html5/decision-policy/decision-policy.html Status: Rejected Change Description: no spec change Rationale: No new information was provided by the reporter since the rejection in comment 4.
Checked in as WHATWG revision r5546. Check-in comment: Update to better match UAs. http://html5.org/tools/web-apps-tracker?from=5545&to=5546
First of all: "Status: rejected" is very very misleading, as you indeed fix one of the problems I reported (the one in comment 1, issue (5)). I'll review the other changes later on, and will probably reopen the bug as you did not address issue (2) (for instance).
(In reply to comment #11) > My test suite is here, in case anyone can test IE9: > http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html 12.test 012: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type " content="text/html; charset=ISO-8859-9"> 13.test 013: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content="text/html; charset=ISO-8859-9" http-equiv="Content-Type "> 14.test 014: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type>" content="text/html; charset=ISO-8859-9"> 15.test 015: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content="text/html; charset=ISO-8859-9" http-equiv="Content-Type>"> 16.test 016: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Style-Type" content="text/html; charset=ISO-8859-9"> 17.test 017: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content="text/html; charset=ISO-8859-9" http-equiv="Content-Style-Type"> 18.test 018: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta name="Content-Style-Type" content="text/html; charset=ISO-8859-9"> 19.test 019: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content="text/html; charset=ISO-8859-9" name="Content-Style-Type"> 20.test 020: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content="text/html; charset=ISO-8859-9"> 21.test 021: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content=" text/html; charset = ISO-8859-9 "> 22.test 022: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta content=" text/html; charset=ISO-8859-9 "> 26.test 026: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta charset=ISO-8859-9"> <p>"</p> 56.test 056: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <script>document.write('<meta charset="ISO-8859-' + '9">')</script> 57.test 057: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <script>var s = '9">'; document.write('<meta charset="ISO-8859-' + s)</script> 58.test 058: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <script>document.write('<meta charset="ISO-8859-9">')</script> 59.test 059: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <script type="text/plain"><meta charset="ISO-8859-9"></script> 60.test 060: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <style type="text/plain"><meta charset="ISO-8859-9"></style> 65.test 065: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta charset=" ISO-8859-9 "> 67.test 067: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta charset=" ISO-8859-9 "> 69.test 069: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta http-equiv="Content-Type" charset=" ISO-8859-9 "> 71.test 071: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta http-equiv="Content-Type" charset=" ISO-8859-9 "> 74.test 074: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html;charset="ISO-8859-9"> 75.test 075: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html;charset='ISO-8859-9'"> 77.test 077: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html;charset='ISO-8859-9';"> 82.test 082: expected Windows-1252; used Windows-1254<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html;charsetx=ISO-8859-9"> 86.test 086: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html charset=ISO-8859-9"> 88.test 088: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html charset = ISO-8859-9"> 89.test 089: expected Windows-1254; used Windows-1252<!DOCTYPE HTML> <meta http-equiv="Content-Type" content="text/html charset = 'ISO-8859-9'">
(In reply to comment #3) > Safari's (closed source) network back-end parses Content-Type for charset in a > manner that's different from both RFC2616 and the current HTML5 spec. We also > have a separate parser in WebKit that's used in a limited set of circumstances > (only for XMLHttpRequest.overrideMimeType and for correcting a MIME type set > via XMLHttpRequest.serRequestHeader). > > But I am no aware of any evidence against following RFC2616 to the letter when > getting charset out of Content-Type. > > > Content-Type: text/plain; charsetfoo=bar; charset=UTF-8 > > Also, this algorithm fails to ignore charset in "Content-Type: text/plain; > foocharset=UTF-8". Indeed, even after the change. Re-opening.
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: http://dev.w3.org/html5/decision-policy/decision-policy.html Status: Rejected Change Description: no spec change Rationale: The spec as written is based on a huge test suite and is intended to most closely match the majority of browsers. If you care more about matching a ridiculously complex spec than about minimising implementation burden then raise an issue.
http://www.w3.org/html/wg/tracker/issues/148
> I have already shown that one aspect (single quotes) isn't supported in IE8, Alexey has stated that they do not implement this either I've been commenting about HTTP Content-Type header field parsing. WebKit parsing of META is different, and it currently supports single quotes (the code is at <http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp#L57>). I don't know if there is a significant amount of content relying on that.