Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, Web project managers, and anyone who is new to character encodings and needs an introduction to how to choose and apply character encodings.
Updated 2010-08-12 12:56
Which character encoding should I use for my content, and how do I apply it to my content?
Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding.
There are many character encodings to choose from. This article offers simple advice on which character encoding to use for your content, and how to apply it, ie. how to actually produce a document in that encoding.
If you need to better understand what characters and character encodings are, see the article Character encodings for beginners.
An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings.
A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.
A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice of encoding.
Any barriers to using Unicode are very low these days. In fact, in August 2010 Google reported that over 50% of the Web in their sample of several billion pages was now using UTF-8. Add to that the figure for ASCII-only web pages (since ASCII is a subset of UTF-8), and the figure rises near to 70%.
There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32 (see Character sets, coded character sets, and encodings). Of these three, UTF-8 is recommended for use with Web content. In fact the HTML5 specification draft currently says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."
Note, in particular, that all ASCII characters in UTF-8 use exactly the same bytes as an ASCII encoding, which often helps with interoperability and backwards compatibility.
Support for a given encoding, even a Unicode encoding, does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.
If you don't use a Unicode encoding. Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.
Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.
Consider a solution that minimizes complexity when dealing with multiple languages and scripts.
The HTML5 specification calls out a number of encodings that you should avoid.
Documents should not use JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, or encodings based on EBCDIC. This is because they allow ASCII code points to represent non-ASCII characters, which poses a security threat.
Documents must not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they were never intended for Web content.
The specification also advises against the use of UTF-32.
As a content author you need to check that your editor or scripts are saving text in the encoding of your choice.
Developers also need to ensure that the various parts of the system can communicate with each other, understand which character encodings are being used, and support all the necessary encodings and characters.
It is important to understand that just declaring an encoding inside a document or on the server using one of the methods described below won't usually change the bytes; you need to save the text in that encoding to apply it to your content. (The declaration just helps the browser interpret the sequences of bytes in which the text is stored.)
The article Setting encoding in web authoring applications provides advice on how to set the encoding of a page while saving it, for a number of editing environments.
If you can, it is best to set up an encoding such as UTF-8 as the default for new documents in your editor. The picture that follows shows how you would do that in the preferences of an editor such as Dreamweaver.
You may also need to check that your server is serving documents with the right HTTP declarations, since it will otherwise override the in-document information (see the next section).
Let's say, for example, that you saved your data as UTF-8. Although you saved your data in the right encoding, and even if you declared in the page that the page encoding is UTF-8, your server may still be serving the page with an accompanying HTTP header that says it is something else.
Any declaration in the HTTP header will override information inside the page, causing problems for your content.
You may not have control over the declarations that come with the HTTP header, and may have to contact the people who manage the server for help. On the other hand there are sometimes ways you can fix things on the server if you have limited access to server setup files or are generating pages using scripting languages. For example, see Setting the HTTP charset parameter for more information about how to change the encoding information, either locally for a set of files on a server, or for content generated using a scripting language.
Typically, before doing so, you need to check whether this is actually the root of the problem or not. You could use the W3C Internationalization Checker to find out what character encoding, if any, is specified in the HTTP header. Alternatively, the article Checking HTTP Headers points to some other tools for checking the encoding information passed by the server.
Tell us what you think (English).
Content first published 2010-08-12. Last substantive update 2010-08-12 12:56 GMT. This version 2012-08-20 18:00 GMT
For the history of document changes, search for qa-choosing-encodings in the i18n blog.
Copyright © 2010-2012 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.