Character encoding in HTML
In the beginning the Web had ASCII. And that was good. But then, not really. The Europeans and their strange accents were a bit of a problem.
So then the Web had iso-latin1. And HTML could be assumed to be using that, by default (RFC2854, section 4). And that was good. But then, not really. There was a whole world out there, with a lot of writing systems, tons of different characters. Many different character encodings...
Today we have Unicode, at long last well adopted in most modern computing systems, and a basic building block of a lot of web technologies. And although there are still a lot of different characters encoding available for documents on the web, this is not an issue, as there are mechanisms, both in HTTP and within HTML for instance, to declare the encoding used, and help tools determine how to decode the content.
All is not always rosy, however. The first issue is that there are quite a lot of mechanisms to declare encoding, and that they don't necessarily agree. The second issue is that not everyone can configure a Web server to declare encoding of HTML documents at the HTTP level.
Many sources, One encoding
if the box says "dangerous, do not open", don't peek inside the box...
A long (web) time ago, there was a very serious discussion to try and determine a Web resource was supposed to know its encoding best, or whether the Web server should be the authoritative source.
In the "resource" camp, some were pushing the rather logical argument that a specific document surely knew best about its own metadata that a misconfigured Web server. Who cares if the server thinks that all HTML document it serves are iso-8859-1
, when I, as document author, know full well that I am authoring this particular resource as utf-8
?
The other camp had two killer arguments.
- The first, and perhaps the simplest, argument was: what's the point of having user agents sniff garbage in hope to find content, and perhaps a character encoding declaration, when the transport protocol has a way of declaring it? This is the basis for the authoritative metadata principle. This principle is also sometimes summarized as: If I want to show an HTML document as plain text source, rather than have it interpreted by browsers, I should be able to do so. I should be able to serve any document as
text/plain
if that is my choice. - The second killer argument was transcoding. A lot of proxies, they said, transform the content they proxy, sometimes from a character encoding to another. So even though a document might say "I am encoded in
iso-2022-jp
", the proxy should be able to say "actually, trust me, the content I am delivering to you is inutf-8
".
In the end, the apparent consensus was that the "server knows best" camp had the sound architectural arguments behind them, and so, for everything on the web served via the HTTP protocol, HTTP has precedence over any other method in determining the encoding (and content type, etc.) of resources.
This means that regardless of what is in an (x)html document, if the server says "this is a text/html
document encoded as utf-8
", user agents should follow that information. Second guessing is likely to cause more harm than good.
Unlabeled boxes can be full of treasures, or full or trouble
But what if there is no character encoding declared at the HTTP level? This is where it gets tricky.
"Old school" HTML introduced a specific meta
tag for the declaration of the encoding within the document:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-5">
Over the years, we have seen that this method was plagued by two serious issues:
- Its syntax.
Nobody seems to get it right (it is just... too complicated!) and the Web is littered with approximate, sometimes comical, variants of this syntax. This is no laughing matter for user agents, however, which can't even expect to find this encoding declaration properly marked up!
- The
meta
elements have to be within thehead
of a document, but there is no guarantee that it will be anywhere near the top of the document. thehead
of a document can have lots of other metadata, title, description, scripts and stylesheeets, before declaring the encoding. This means a lot of sniffing and pseudo-parsing of undecoded garbage. In some cases, it can have dreadful consequences, such as security flaws in the approximate sniffing code.
It is worth noting that current work on html5 tries to work around these issues by providing a simpler alternate syntax, and making sure that the declaration of encoding should be present at the very beginning of the head
.
XML, on the other hand, had a way to declare encoding at the document level in the XML declaration. The good thing about that being that this declaration MUST be at the very beginning of the document, which alleviates the pain of having to sniff the content.
<?xml version="1.0" encoding="UTF-8"?>
The XML specification also defines, in its Appendix F, a recommended algorithm for the encoding detection.
The Recipe
Given all these potential sources for the declaration (or automatic detection) of the document character encoding, all potentially contradicting the others, what should be the recipe to reliably figure out which encoding to use?
- The charset info in the HTTP
Content-Type
header should have precedence. Always - Next in line is the charset information in the XML declaration. Which may be there, or may not.
For XHTML documents, and in particular for the XHTML documents served as
text/html
, it is recommended to avoid using an XML declaration.But let's remember: XHTML is XML, and XML requires an XML declaration or some other method of declaration for XML documents using encodings other than UTF-8 or UTF-16 (or ascii, which is a convenient subset...).
As a result, there is a strong likeliness that anything served as
application/xhtml+xml
(ortext/html
and looking a lot like XHTML), with neither encoding declaration at the HTTP level nor in an XML declaration is quite likely to be UTF-8 or UTF-16Then there is the BOM, a signature for Unicode character encodings.
- Then comes the search for the
meta
information that might, just might, provide a character encoding declaration. - Beyond that point, it's the land of defaults and heuristics. You may choose to default to
iso-8859-1
fortext/html
resources,utf-8
forapplication/xhtml+xml
.The rest is heuristics. You could venture towards fallback encodings such as
windows-1252
, which many consider a safe bet, but a bet nonetheless. - There are quite a few algorithms to determine the likeliness of one specific encoding based on matching at byte level. Martin Dürst wrote a regexp to check whether a document will "fit" as utf-8. If you know other reliable algorithms, feel free to mention them in the comments, I will list them here.
Does this seem really ugly and complicated to you? You will love the excellent Encoding Divination Flow Chart by Philip Semanchuk, the developer of the Web quality checker "Nikita the spider".
Or, if this is still horribly fuzzy after looking at the flow chart, why not let a tool do that for you? The HTML::Encoding perl module by Björn Höhrmann does just that.
Last word... for HTML authors
If you create content on the Web and never have to read and parse content on the web, and if you have read that far, you are probably considering yourself very lucky right now. But you can make a difference by making sure the content you put on the web is using consistent character encodings, and declare them properly. Your job is actually much easier than the tricky winding road to determining a document's encoding. In the proverbial three steps:
- Use
utf-8
. Unless you have very specific needs such as very rare character variants in asian languages, this should be your charset of choice. Most modern text, web or code editors are likely to support UTF-8, some actually only support this encoding. If possible, choose an editor or set up that will not output a BOM in UTF-8 files, as this is known to cause some ugly display issues with some agents, and can even crash php includes. - If you have access to the configuration of your web server, make sure that it serves html as utf-8
- That's all.
In regards the final note to HTML authors, our server serves HTML as iso-8859-1. If I follow rule one and use utf-8, then I get a warning from validation engines that the page meta charset tag encoding declaration disagrees with the HTTP header charset encoding declaration. So to stay clean and valid I have to use iso-8859-1 in my html pages, including those I write in html5.
The server cannot be changed without potentially breaking the large number of existing HTML4 pages that declare themselves to be iso-8859-1 encoding. My guess is that this situation is fairly common. Thus to use utf-8, I have to code using xml and send pages with a .xhtml extension. My XHTML pages are sent as application/xhtml+xml with utf-8 encoding by our server for which IE happily shows only the page code.
Thanks for the excellent article in any case!
@ Dana Lee Ling: good point indeed. A web server should either give the content managers the possibility of overriding the default character encoding, or not set a default at all. You may want to point whoever manages your web server to the chips document, particularly the section on character encoding...
On the W3C Web server we solved this issue thanks to dated space URIs. The default used to be iso-8859-1 for all documents, but for anything published into, say, /2007/, the default is utf-8.
The first point of the recipe, a "charset info in the HTTP Content-Type header should have precedence", won't fly if an explicit Latin-1 actually means "dunno", while no charset means default Latin-1, or in practice windows-1252 as far as HTML 5 is concerned. For this part of the madness folks could still try to fix it in the HTTP WG working on 2616bis.
A serious problem is the alleged "rough consensus" for "the server knows best". One of several premises is that there are no other protocols and URI schemes, only HTTP exists.
In the real world many HTTP servers don't know best, have no time to guess, if they try it anyway they will often get it wrong, many users have no way to fix it, and if the server says it is Latin-1 this likely means "dunno", while "dunno" means default Latin-1, see above.
@ Frank: I agree that the default of latin-1 in HTTP is problematic, and it looks like the WG working on HTTPbis is not refusing to look into it, but they can't find a good workaround. You may be able to help by drafting one (or several) replacements and bring that to their consideration?
Some followup discussion on the comment sent by Frank is taking place on the w3c validators mailing-list, too.
XML still allows infinite whitespace within the XML declaration...
@ Anne: Yes, an xml declaration can have whitespace in it.
I'm not sure I get your point, though… I certainly wouldn't agree that the whitespace within the declaration puts a heavy burden on the parser, and since the spec very clearly forbids having whitespace (or anything, for that matter) before the declaration, I think charset info in an XML declaration is pretty much the easiest in-document encoding declaration, ever.
That is not correct. The article you're referring to in the link talks about entities, by the way.
Anyways, XML documents actually must not use UTF-16 if their MIME type is text/xml. The assumed defaults are UTF-8 for XML documents served with an application/xml type, and ASCII with text/xml (also overriding the HTTP iso-8859-1 default).
@ David Zülke
I think the term “entity” in the XML specification is sometimes confusing…
I am rewording the article to be clearer about the fact that each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration and that in this context, as far as I (and every expert I asked about this), entity is a physical building block of an XML document.
I don't think that's true. See the RFC on text/xml.
utf-8 for asian languages can be a much larger payload than say Big5 or an specific encoding for that language. I've seen requirements for lots of websites that specify character encodings specific to the "locale" being displayed.
@ Brian Repko
I see your point here.
As a developer of content, e.g in Japanese, I could indeed just work with iso-2022-jp or shift-jis, but the fact is I don't know who is going to want to parse/read/use that content.
And in order to keep options open, I think I'd rather bet on internationalized tools that support unicode – even if, admittedly, some local tools are sometimes supporting only the local encodings, and not utf-8…
I'm glad you like the encoding divination flowchart. It took me a while to puzzle out those rules and writing the article helped to reinforce what I'd learned. Looking at it now with fresh eyes, I see that it's more complicated than it needs to be and I'll simplify it when I get a chance.
I'm curious if you know of any specification that states what precedence a BOM has relative to a META http-equiv or XML encoding declaration. I have not been able to find one. In my flowchart, I gave BOMs second priority after the HTTP Content-Type header because I figured that a BOM written by a text editor was more likely to be correct than the page author.
In "The Recipe" above, BOMs are listed in item #2, but almost as an afterthought. Don't you think they deserve their own item in the list? After all, one can find them in documents that have no pretensions whatsoever to being XML.
"In the beginning the Web had ASCII."
When was that? My recollection was that the Web started with Latin-1 - which was seen as one of its advantages (by European folks) and then became a disadvantage (eg harder to introduce UTF-8).
The "last word" is sound advice. There is very little reason to use anything other than UTF-8 nowadays for any new content.
@Brian: Is the Big5 vs UTF-8 size difference for plain text, or for markup (HTML, SVG, whatever)? Interested to see some stats on that.
Regarding text/* and UTF-16 - yes, actually, the requirements for text/* top level type on fallback, and the decision in RFC 3023 for HTTP charset to override the character encoding in the content even if the charset is missing means that, in theory, UTF-16 content could be displayed as fallback text/plain in US-ASCII with every other character having code point zero. In practice people seem to believe the content, if HTTP supplies no charset.