In the beginning the Web had ASCII. And that was good. But then, not really. The Europeans and their strange accents were a bit of a problem.
So then the Web had iso-latin1. And HTML could be assumed to be using that, by default (RFC2854, section 4). And that was good. But then, not really. There was a whole world out there, with a lot of writing systems, tons of different characters. Many different character encodings…
Today we have Unicode, at long last well adopted in most modern computing systems, and a basic building block of a lot of web technologies. And although there are still a lot of different characters encoding available for documents on the web, this is not an issue, as there are mechanisms, both in HTTP and within HTML for instance, to declare the encoding used, and help tools determine how to decode the content.
All is not always rosy, however. The first issue is that there are quite a lot of mechanisms to declare encoding, and that they don’t necessarily agree. The second issue is that not everyone can configure a Web server to declare encoding of HTML documents at the HTTP level.
Many sources, One encoding
if the box says “dangerous, do not open”, don’t peek inside the box…
A long (web) time ago, there was a very serious discussion to try and determine a Web resource was supposed to know its encoding best, or whether the Web server should be the authoritative source.
In the “resource” camp, some were pushing the rather logical argument that a specific document surely knew best about its own metadata that a misconfigured Web server. Who cares if the server thinks that all HTML document it serves are
iso-8859-1, when I, as document author, know full well that I am authoring this particular resource as
The other camp had two killer arguments.
The first, and perhaps the simplest, argument was: what’s the point of having user agents sniff garbage in hope to find content, and perhaps a character encoding declaration, when the transport protocol has a way of declaring it? This is the basis for the authoritative metadata principle. This principle is also sometimes summarized as: If I want to show an HTML document as plain text source, rather than have it interpreted by browsers, I should be able to do so. I should be able to serve any document as
text/plainif that is my choice.
The second killer argument was transcoding. A lot of proxies, they said, transform the content they proxy, sometimes from a character encoding to another. So even though a document might say “I am encoded in
iso-2022-jp“, the proxy should be able to say “actually, trust me, the content I am delivering to you is in
In the end, the apparent consensus was that the “server knows best” camp had the sound architectural arguments behind them, and so, for everything on the web served via the HTTP protocol, HTTP has precedence over any other method in determining the encoding (and content type, etc.) of resources.
This means that regardless of what is in an (x)html document, if the server says “this is a
text/html document encoded as
utf-8“, user agents should follow that information. Second guessing is likely to cause more harm than good.
Unlabeled boxes can be full of treasures, or full or trouble
But what if there is no character encoding declared at the HTTP level? This is where it gets tricky.
“Old school” HTML introduced a specific
meta tag for the declaration of the encoding within the document:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-5">
Over the years, we have seen that this method was plagued by two serious issues:
Nobody seems to get it right (it is just… too complicated!) and the Web is littered with approximate, sometimes comical, variants of this syntax. This is no laughing matter for user agents, however, which can’t even expect to find this encoding declaration properly marked up!
metaelements have to be within the
headof a document, but there is no guarantee that it will be anywhere near the top of the document. the
headof a document can have lots of other metadata, title, description, scripts and stylesheeets, before declaring the encoding. This means a lot of sniffing and pseudo-parsing of undecoded garbage. In some cases, it can have dreadful consequences, such as security flaws in the approximate sniffing code.
It is worth noting that current work on html5 tries to work around these issues by providing a simpler alternate syntax, and making sure that the declaration of encoding should be present at the very beginning of the
XML, on the other hand, had a way to declare encoding at the document level in the XML declaration. The good thing about that being that this declaration MUST be at the very beginning of the document, which alleviates the pain of having to sniff the content.
<?xml version="1.0" encoding="UTF-8"?>
The XML specification also defines, in its Appendix F, a recommended algorithm for the encoding detection.
Given all these potential sources for the declaration (or automatic detection) of the document character encoding, all potentially contradicting the others, what should be the recipe to reliably figure out which encoding to use?
The charset info in the HTTP
Content-Typeheader should have precedence. Always
Next in line is the charset information in the XML declaration. Which may be there, or may not.
For XHTML documents, and in particular for the XHTML documents served as
text/html, it is recommended to avoid using an XML declaration.
But let’s remember: XHTML is XML, and XML requires an XML declaration or some other method of declaration for XML documents using encodings other than UTF-8 or UTF-16 (or ascii, which is a convenient subset…).
As a result, there is a strong likeliness that anything served as
text/htmland looking a lot like XHTML), with neither encoding declaration at the HTTP level nor in an XML declaration is quite likely to be UTF-8 or UTF-16
Then there is the BOM, a signature for Unicode character encodings.
Then comes the search for the
metainformation that might, just might, provide a character encoding declaration.
Beyond that point, it’s the land of defaults and heuristics. You may choose to default to
The rest is heuristics. You could venture towards fallback encodings such as
windows-1252, which many consider a safe bet, but a bet nonetheless.
There are quite a few algorithms to determine the likeliness of one specific encoding based on matching at byte level. Martin Dürst wrote a regexp to check whether a document will “fit” as utf-8. If you know other reliable algorithms, feel free to mention them in the comments, I will list them here.
Does this seem really ugly and complicated to you? You will love the excellent Encoding Divination Flow Chart by Philip Semanchuk, the developer of the Web quality checker “Nikita the spider”.
Or, if this is still horribly fuzzy after looking at the flow chart, why not let a tool do that for you? The HTML::Encoding perl module by Björn Höhrmann does just that.
Last word… for HTML authors
If you create content on the Web and never have to read and parse content on the web, and if you have read that far, you are probably considering yourself very lucky right now. But you can make a difference by making sure the content you put on the web is using consistent character encodings, and declare them properly. Your job is actually much easier than the tricky winding road to determining a document’s encoding. In the proverbial three steps:
utf-8. Unless you have very specific needs such as very rare character variants in asian languages, this should be your charset of choice. Most modern text, web or code editors are likely to support UTF-8, some actually only support this encoding. If possible, choose an editor or set up that will not output a BOM in UTF-8 files, as this is known to cause some ugly display issues with some agents, and can even crash php includes.
- If you have access to the configuration of your web server, make sure that it serves html as utf-8
- That’s all.