Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Character encodings

Intended audience: anyone wanting a quick summary of key information related to character encodings in HTML and XML. For more information follow the links in the text or see Further reading.

Updated 2006-07-20 09:00

The Document Character Set

The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode. Read more about the document character set.

Declaring encodings

It is very important that the character encoding of any XML or (X)HTML document is clearly labeled, so that clients can easily map these encodings to Unicode. This can be done in the following ways...

For a discussion of which approach is best for which type of (X)HTML document, see the tutorial Character sets & encodings in XHTML, HTML and CSS.

The examples above show declarations for UTF-8 encoded content. This is likely to be the best choice of encoding for most purposes, but it is not the only possibility.

If not using UTF-8 you should replace the utf-8 text in the examples above with the name of the encoding you have chosen. You can see the full list of character encoding names registered by IANA (long). In practice, a few encodings will be preferred, most likely: ISO-8859-1 (Latin-1), US-ASCII, UTF-16, the other encodings in the ISO-8859 series, iso-2022-jp, euc-kr, and so on.

Ensuring the declaration works

It is important to not only use the encoding declarations above in HTTP or content, but also:

For more information on these topics follow the links in Changing (X)HTML page encoding to UTF-8. Although it is written from a UTF-8 perspective, it applies to whatever encoding you use.

By the way

Values for the encoding attribute can be found in the IANA registry. Note that these are called charset names, although in reality they refer to the encodings, not the character sets.

If you want in-depth information related to the term 'charset', see an article by Dan Connolly ("Character Set" Considered Harmful) and a response by Glenn Adams (Character Set Terminology, SC2 vs. SC18 vs. Internet Standards).

Historic note: Rick Jellife proposed to use the SPREAD entities from ERCS.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

Helpful introductions:

References in specifications:

Other links:

By: Bert Bos, W3C. Changed by: Martin J. Dürst, W3C; Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 1996-05-31. Last substantive update 2006-07-20 09:00 GMT. This version 2011-01-26 20:10 GMT

For the history of document changes, search for article-O-charset in the i18n blog.