9628 – Algorithm for detecting the charset="" parameter.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9628 - Algorithm for detecting the charset="" parameter.

Summary: Algorithm for detecting the charset="" parameter.

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:	NE, TrackerIssue

Depends on:
Blocks:

Reported:	2010-04-30 14:49 UTC by Julian Reschke
Modified:	2011-03-11 18:33 UTC (History)
CC List:	7 users (show)

See Also:

Attachments
content type with encoding in single quotes (62 bytes, text/plain) 2010-08-17 11:39 UTC, Julian Reschke	Details
content type with encoding in double quotes (62 bytes, text/plain) 2010-08-17 11:40 UTC, Julian Reschke	Details
test case with charsetfoo parameter (76 bytes, text/plain) 2010-08-17 12:28 UTC, Julian Reschke	Details

Description Julian Reschke 2010-04-30 14:49:48 UTC

From http://dev.w3.org/html5/spec/infrastructure.html#content-type-sniffing:

"The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns an encoding or nothing.

   1.

      Find the first seven characters in s that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing.
   2.

      Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the word "charset" (there might not be any).
   3.

      If the next character is not a U+003D EQUALS SIGN ('='), return nothing and abort these steps.
   4.

      Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the equals sign (there might not be any).
   5.

      Process the next character as follows:

      If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s
      If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s
          Return the encoding corresponding to the string between this character and the next earliest occurrence of this character.
      If it is an unmatched U+0022 QUOTATION MARK ('"')
      If it is an unmatched U+0027 APOSTROPHE ("'")
      If there is no next character
          Return nothing.
      Otherwise
          Return the encoding corresponding to the string from this character to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the end of s, whichever comes first.

Note: This requirement is a willful violation of the HTTP specification, motivated by the need for backwards compatibility with legacy content. [HTTP]"

General problems:

(1) This algorithm doesn't seem to be used.

(2) It's VERY unfriendly to the reader to claim that there's a violation of the HTTP spec without saying what it is.

Specific problems:

(3) The algorithm requires allowing single quotes; this is indeed a violation of the HTTP syntax. I just checked with IE8; it doesn't allow single quotes. Thus, the claim "needed for backwards compatibility" appears to be incorrect.

(4) The spec also violates HTTP in that the backslash character inside quoted values isn't treated properly. If this is needed "for compatibility", this should be backed up with data.

Comment 1 Julian Reschke 2010-04-30 16:45:16 UTC

(5) The algorithm fails to parse headers that have an additional parameter starting with the letters "charset", such as 

  Content-Type: text/plain; charsetfoo=bar; charset=UTF-8

(or a preceding parameter that happens to have "charset" as value).

Comment 2 Julian Reschke 2010-05-07 17:03:55 UTC

(In reply to comment #0)
> General problems:
> 
> (1) This algorithm doesn't seem to be used.

OK, I just noticed that the back ref tool top only works as expected when using the single-page version. So this is used, and thus fixing the other issues is even more important.

Comment 3 Alexey Proskuryakov 2010-08-11 19:49:27 UTC

Safari's (closed source) network back-end parses Content-Type for charset in a manner that's different from both RFC2616 and the current HTML5 spec. We also have a separate parser in WebKit that's used in a limited set of circumstances (only for XMLHttpRequest.overrideMimeType and for correcting a MIME type set via XMLHttpRequest.serRequestHeader).

But I am no aware of any evidence against following RFC2616 to the letter when getting charset out of Content-Type.

>  Content-Type: text/plain; charsetfoo=bar; charset=UTF-8

Also, this algorithm fails to ignore charset in "Content-Type: text/plain; foocharset=UTF-8".

Comment 4 Ian 'Hixie' Hickson 2010-08-16 21:45:19 UTC

(1) It's used in several places, click the name of the algorithm for a list.

(2) Isn't the violation pretty obvious? I don't really know how to begin describing it, I mean, the whole algorithm is in its entirety one big violation, no?

(3) It is my understanding that, notwithstanding your experience with IE, there are pages depending on this. However, I may be mistaken. If you have detailed research on this topic, that would be most welcome. I'm happy to change this algorithm in response to such data.

(4, 5, etc) The algorithm violates HTTP in a huge bunch of ways, backslashes aren't the half of it.

If I recall correctly, this algorithm was based directly on one of the prominent implementations. I do not recall which. I'm happy to update it if there is clear evidence that the current algorithm is not compatible with legacy content.

As a general note, please file just one bug per issue. Thanks.

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: Insufficient data.

Comment 5 Julian Reschke 2010-08-17 06:21:39 UTC

If you do believe that you need to "willfully" violate the spec, then it's you who needs to back this up with data.

I have already shown that one aspect (single quotes) isn't supported in IE8, Alexey has stated that they do not implement this either (more data on this would indeed be useful), and also the algorithm fails to properly parse valid values.

Comment 6 Julian Reschke 2010-08-17 11:39:35 UTC

Created attachment 908 [details]
content type with encoding in single quotes

"asis" document to be served with httpd/mod_asis

Comment 7 Julian Reschke 2010-08-17 11:40:20 UTC

Created attachment 909 [details]
content type with encoding in double quotes

"asis" document to be served with httpd/mod_asis

Comment 8 Julian Reschke 2010-08-17 11:41:45 UTC

(In reply to comment #4)
> (3) It is my understanding that, notwithstanding your experience with IE, there
> are pages depending on this. However, I may be mistaken. If you have detailed
> research on this topic, that would be most welcome. I'm happy to change this
> algorithm in response to such data.

I just added two test cases that show that IE8 does not parse encoding in single quotes, as required by the spec. Thus, violating the spec is not needed for "web compatibility".

Comment 9 Julian Reschke 2010-08-17 12:02:41 UTC

(In reply to comment #8)
> (In reply to comment #4)
> > (3) It is my understanding that, notwithstanding your experience with IE, there
> > are pages depending on this. However, I may be mistaken. If you have detailed
> > research on this topic, that would be most welcome. I'm happy to change this
> > algorithm in response to such data.
> 
> I just added two test cases that show that IE8 does not parse encoding in
> single quotes, as required by the spec. Thus, violating the spec is not needed
> for "web compatibility".

Where with "the spec" I mean RFC 2616.

Comment 10 Julian Reschke 2010-08-17 12:28:21 UTC

Created attachment 910 [details]
test case with charsetfoo parameter

"asis" document to be served with httpd/mod_asis

This test case shows that Firefox ignores an unknown parameter, even though it starts with the sequence "charset".

Comment 11 Ian 'Hixie' Hickson 2010-09-29 02:00:34 UTC

None of these tests are testing what the spec is doing here. This has nothing to do with HTTP. If you click the name of the algorithm you will get a list of the places that use this.

Since I was looking at this part of the spec, though, I went ahead and redid a round of testing to see how the browsers were doing today. I didn't have IE handy, but looking at the other major browsers, it looks like all but Firefox skip past unknown parameters, so I've changed the spec to support that.

WebKit and Opera's main sources of bugs are that they do extra trimming that the spec doesn't expect. Both also do weird things with unmatched single quotes that they don't do with unmatched double quotes, which is weird. WebKit also seems to have trouble with "&quot;". Gecko doesn't drop charset declarations when they have unmatched quotes, but otherwise seems to match the spec. I haven't changed the spec to match these quirks, since it's not clear whether pages depend on them or not.

My test suite is here, in case anyone can test IE9:
http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html

Status: Rejected
Change Description: no spec change
Rationale: No new information was provided by the reporter since the rejection in comment 4.

Comment 12 contributor 2010-09-29 02:03:34 UTC

Checked in as WHATWG revision r5546.
Check-in comment: Update to better match UAs.
http://html5.org/tools/web-apps-tracker?from=5545&to=5546

Comment 13 Julian Reschke 2010-09-29 07:03:54 UTC

First of all: "Status: rejected" is very very misleading, as you indeed fix one of the problems I reported (the one in comment 1, issue (5)).

I'll review the other changes later on, and will probably reopen the bug as you did not address issue (2) (for instance).

Comment 14 Julian Reschke 2010-09-29 09:54:48 UTC

(In reply to comment #11)
> My test suite is here, in case anyone can test IE9:
>    http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html

12.test 012: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type " content="text/html; charset=ISO-8859-9">
13.test 013: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content="text/html; charset=ISO-8859-9" http-equiv="Content-Type ">
14.test 014: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type>" content="text/html; charset=ISO-8859-9">
15.test 015: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content="text/html; charset=ISO-8859-9" http-equiv="Content-Type>">
16.test 016: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Style-Type" content="text/html; charset=ISO-8859-9">
17.test 017: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content="text/html; charset=ISO-8859-9" http-equiv="Content-Style-Type">
18.test 018: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta name="Content-Style-Type" content="text/html; charset=ISO-8859-9">
19.test 019: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content="text/html; charset=ISO-8859-9" name="Content-Style-Type">
20.test 020: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content="text/html; charset=ISO-8859-9">
21.test 021: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content=" text/html; charset = ISO-8859-9 ">
22.test 022: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta content="
text/html; charset=ISO-8859-9
">
26.test 026: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta charset=ISO-8859-9">
<p>"</p>
56.test 056: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<script>document.write('<meta charset="ISO-8859-' + '9">')</script>
57.test 057: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<script>var s = '9">'; document.write('<meta charset="ISO-8859-' + s)</script>
58.test 058: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<script>document.write('<meta charset="ISO-8859-9">')</script>
59.test 059: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<script type="text/plain"><meta charset="ISO-8859-9"></script>
60.test 060: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<style type="text/plain"><meta charset="ISO-8859-9"></style>
65.test 065: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta charset=" 	
ISO-8859-9	 ">
67.test 067: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta charset=" ISO-8859-9 ">
69.test 069: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta http-equiv="Content-Type" charset=" 	
ISO-8859-9	 ">
71.test 071: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta http-equiv="Content-Type" charset=" ISO-8859-9 ">
74.test 074: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html;charset=&quot;ISO-8859-9">
75.test 075: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html;charset='ISO-8859-9'">
77.test 077: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html;charset='ISO-8859-9';">
82.test 082: expected Windows-1252; used Windows-1254<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html;charsetx=ISO-8859-9">
86.test 086: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html charset=ISO-8859-9">
88.test 088: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html charset = ISO-8859-9">
89.test 089: expected Windows-1254; used Windows-1252<!DOCTYPE HTML>
<meta http-equiv="Content-Type" content="text/html charset = 'ISO-8859-9'">

Comment 15 Julian Reschke 2010-11-13 17:34:47 UTC

(In reply to comment #3)
> Safari's (closed source) network back-end parses Content-Type for charset in a
> manner that's different from both RFC2616 and the current HTML5 spec. We also
> have a separate parser in WebKit that's used in a limited set of circumstances
> (only for XMLHttpRequest.overrideMimeType and for correcting a MIME type set
> via XMLHttpRequest.serRequestHeader).
> 
> But I am no aware of any evidence against following RFC2616 to the letter when
> getting charset out of Content-Type.
> 
> >  Content-Type: text/plain; charsetfoo=bar; charset=UTF-8
> 
> Also, this algorithm fails to ignore charset in "Content-Type: text/plain;
> foocharset=UTF-8".

Indeed, even after the change.

Re-opening.

Comment 16 Ian 'Hixie' Hickson 2010-11-13 21:24:11 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: The spec as written is based on a huge test suite and is intended to most closely match the majority of browsers. If you care more about matching a ridiculously complex spec than about minimising implementation burden then raise an issue.

Comment 17 Sam Ruby 2010-11-29 19:45:00 UTC

http://www.w3.org/html/wg/tracker/issues/148

Comment 18 Alexey Proskuryakov 2011-03-11 18:33:43 UTC

> I have already shown that one aspect (single quotes) isn't supported in IE8, Alexey has stated that they do not implement this either

I've been commenting about HTTP Content-Type header field parsing. WebKit parsing of META is different, and it currently supports single quotes (the code is at <http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLMetaCharsetParser.cpp#L57>). I don't know if there is a significant amount of content relying on that.