Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Introducing Character Sets and Encodings

Intended audience: anyone who is new to internationalization and needs guidance on topics to consider and ways to get into the material on the site.

This page provides some orientation for newcomers to Web internationalization who don't really know where to start.

By listing a number of articles we have created relative to a particular topic area, we hope to help you see how things fit together, and give you a starting point for further exploration of the topic and the other related articles on the site.

We also try to indicate, in broad brush stokes, who would be interested in what aspects of the topic.

After reading these resources, you can find more detailed information using the topic index, techniques index or the search box on this page.

What's it about?

A character set is a collection of letters and symbols used in a writing system. For example, the ASCII character set covers letters and symbols for English text, ISO-8859-6 covers letters and symbols needed for many languages based on the Arabic script, and the Unicode character set contains characters for most of the living languages and scripts in the world.

Characters in a character set are stored as one or more bytes in a computer. Each byte or sequence of bytes represents a given character. A character encoding is the key that maps a particular byte or sequence of bytes to particular characters that the font renders as text.

There are many different character encodings. If the wrong encoding is applied to the bytes in memory, the result will be unintelligible text. It is therefore important, if people are to read your content, that you correctly label the character encoding used.

Choosing an encoding

Everyone developing content, whether content authors or programmers, must decide what character encoding to use. UTF-8 is a popular recommendation these days, but there may still be things you should consider before using it.

Declaring and applying an encoding

Once it has been decided what encoding to use, content developers and programmers must ensure that it is declared in the right way.

With a technology such as XHTML, encoding declarations are not always straightforward; they require an understanding of 'standards' vs. 'quirks' modes, and the impact of the XML declaration.

You must also ensure that your data is saved in the encoding you have chosen, it is not sufficient to just label it.

Content developers and webmasters may also need to ensure that the server delivers content with the correct character encoding declarations, since server settings can override in-document declarations.

Escapes

Escapes are a way of representing a character using only ASCII text. They provide a way of representing characters that are not available in the character encoding you are using, or a way of avoiding the use of the character for other reasons (such as when they may conflict with syntax). You should be clear on when and how these escapes should be used.

Web addresses

These days Web addresses can also include non-ASCII characters. The user does little other than click on the appropriate link or enter the text as they see it, the heavy lifting is done by the user agent, but you may be interested to know how this works.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Validating and troubleshooting

If you are struggling with an encoding problem, you may find the following resources of use:

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2006-01-16. Last substantive update 2006-01-16 18:29 GMT. This version 2007-03-07 12:45 GMT

For the history of document changes, search for gs-characters in the i18n blog.