11973 – HTML Spec confuses character sets with character encodings

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11973 - HTML Spec confuses character sets with character encodings

Summary: HTML Spec confuses character sets with character encodings

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC other

Importance:	P3 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/spec/Overview...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-02-03 19:51 UTC by Craig S
Modified:	2011-08-04 05:17 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Craig S 2011-02-03 19:51:58 UTC

The relevant part of the specification is as follows: "The charset attribute specifies the character encoding used by the document."

This has been a problem with the HTML specification since...well, since at least HTML 3.2, but more of a problem in the late '90s since HTML 4.01, the formulation of XML 1.0, and the rising use of Unicode for information interchange and exchange.

What I find confusing is the specification's mixing up of the terms character set and character encoding. At one point, the spec is talking about the character set of the document, while at another it is clearly talking about the character encoding of the document--though using the misnomer attribute name @charset.

I recently worked on a project grabbing text from MS Word, storing it in a database, and retrieving that text to display on a web page. As you might have already guessed, I got "funky" characters in the output (usually ?'s, some /'s and a few boxes). This is to be expected, unfortunately.

The problem is that the text was saved in the UTF-8 character encoding, and so the web page was sent with the following Content-Type: "text/html; charset=UTF-8". However, the document is using the windows-1252 character set. Let me rephrase this: the text is encoded with UTF-8 using the windows-1252 character set (which is what MS Word uses).

Now, if I change the http-equiv=Content-Type to the following: "text/html; charset=Windows-1252", then the document displays correctly. Therefore, even though the spec clearly says that "charset...specifies the character encoding used by the document", it should instead read "charset...specifies the character set used by the document."

However, I recognize that it is equally important for a UA to know how the document is encoded, as has been discussed with the potential security implications of UTF-7 over UTF-8, for example. Therefore, I propose that the specification also include an @encoding attribute, perhaps on the META element, much as XML 1.0 has an @encoding attribute. In this way, a UA can unambiguously determine both the encoding used to store the document as bytes, and the set of characters those bytes encode, i.e. the character set. Furthermore, a META element with the @encoding attribute should be mandatory since it is impossible to differentiate, for example, a document that has been stored in UTF-8 vs. ANSI.

With these two pieces of information, a UA now knows how to decode the bytes of a document and which characters those bytes encode.

Comment 1 Tab Atkins Jr. 2011-02-03 20:03:31 UTC

(In reply to comment #0)
> Let me rephrase this: the text is encoded with UTF-8 using the windows-1252
> character set (which is what MS Word uses).

I'm not certain I understand.  Do you mean that the document is using utf-8 encoding, but with the windows-1252 character set masquerading as codepoints?  So that, for example, € is encoded as if its codepoint was 0x80 rather than 0x20ac?

Comment 2 Julian Reschke 2011-02-03 20:05:19 UTC

(In reply to comment #0)
> Let me rephrase this: the text is encoded with UTF-8 using the windows-1252
> character set (which is what MS Word uses).

That doesn't make sense to me.

The character set of HTML (as in: the repertoire of characters that can be used) is fixed to be Unicode.

It is *encoded* in exactly one encoding (well, at least a non-broken document). No matter what the metadata says.

It's unfortunate that some attributes/params say "charset" when they should say "encoding", but that's something that can't be easily changed.

Comment 3 Craig S 2011-02-03 20:16:26 UTC

I was doing a lot of reading, and this is the best I could explain it.

I took a Word document and used the File->Save As feature to save the document as Html (filtered), which removes all the MS-specific XML namespace stuff and sticks to traditional HTML.

As we all know, Word replaces the standard apostrophe and double-quotes with "curly" versions. Now, when I looked at the hexadecimal value of a right single quote as stored in the document, it had the following value: 0xe2, 0x80, 0x99 (which shows up as lower-case a with a caron, a Euro currency symbol, and the trademark symbol, in the dump viewer). Now, this is a UTF-8 encoding for a right single quote. However, in my web browser (IE9 beta), it shows up as a '?'.

Now, if I actually specify in a META element http-equiv=Content-Type content="text/html; charset=windows-1252", then the page is displayed correctly with the correct character, even though that character is still encoded with the 3 bytes shown above.

Perhaps I misunderstood the problem, however, from what I can see, Word uses the windows-1252 character set, and when I send the charset=windows-1252 over to the UA, it displays correctly. As far as I know, windows-1252 does not necessarily need to be encoded in UTF-8. It could just as easily use ASCII encoding.

Comment 4 Craig S 2011-02-03 20:21:26 UTC

I was mistaken about the lower-case 'a' with caron, it is a â.

Comment 5 Simon Pieters 2011-02-03 20:54:40 UTC

As far as I can tell, the spec is not confused; it's just as Julian says that some attributes/params have unfortunate names for legacy reasons.

Comment 6 Craig S 2011-02-03 21:07:50 UTC

(In reply to comment #5)
> As far as I can tell, the spec is not confused; it's just as Julian says that
> some attributes/params have unfortunate names for legacy reasons.

I agree. However, I still maintain that the spec really should say that "The charset attribute specifies the character set used by the document.", as this seems to be the way UA's are in fact treating it. This would at least make the spec "definition" align with the attribute name. In addition, perhaps the spec could mandate that all HTML files are to be encoded (read stored or saved) as UTF-8. Then, with the combination of the mandated encoding, and the declaration of the character set, a UA knows how to interpret the document. Also, it preserves 99.999% of all web pages in the wild (since ANSI/ASCII plain is already valid UTF-8). Furthermore, the spec can continue to say that the default character set for HTML is UTF-8 (and should you want anything different, be sure to specify it with the META tag using one of the specified methods).

I found a snippet of text on stackoverflow.com (http://stackoverflow.com/questions/2014069/windows-1252-to-utf-8-encoding) that was interesting:

"While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8."

This explains why I see "funky" characters in my HTML page when sent as charset=UTF-8 as opposed to charset=windows-1252 (which displays correctly).

Thank you all for your comments.

Comment 7 Anne 2011-02-04 10:40:45 UTC

Craig, your terminology is incorrect. UTF-8 or Windows-1252 are no "character sets". See http://en.wikipedia.org/wiki/Character_encoding for more information.

Comment 8 Craig S 2011-02-04 14:36:09 UTC

(In reply to comment #7)
> Craig, your terminology is incorrect. UTF-8 or Windows-1252 are no "character
> sets". See http://en.wikipedia.org/wiki/Character_encoding for more
> information.

Hmm, I stand corrected. It just goes to show how what should be a simple concept can be such a confusing concept--it's a shame, too, given how important it is in computing for internationalization and information interchange.

Thanks again for the wiki page. Even though I'm wrong in regards to this, I learned something, which is what's most important.

Kindest Regards,

Craig S.

Comment 9 Michael[tm] Smith 2011-08-04 05:17:22 UTC

mass-move component to LC1