What is encoding

From Internationalization


Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.

See the [I18n Core home page].

Author: Richard Ishida

What is character encoding, and why should I care?

First, why should I care?

If you use anything other than the most basic letters and numbers of the English alphabet, people may not be able to read your text unless you say what character encoding you used.

For example, you may intend the text to look like this:

mojibake1.gif

but it may actually display like this:

mojibake2.gif

Not only does inadequate encoding information spoil the readability of displayed text, but it may mean that your data cannot be found by a search, or reliably processed in a number of other ways.

What's a character encoding?

Words and sentences in text are created from characters. Examples of characters include the Latin letter á or the Chinese ideograph 請 or the Devanagari character ह.

Characters are grouped into a character set (also called a repertoire), in which each character is assigned a particular number, called a codepoint. These codepoints are then represented in the computer by one or more bytes.

Basically, this means that all characters are stored in computers using a code, like the ciphers used in espionage. A character encoding is a key to unlock (ie. crack) the code. It is a set of mappings between the bytes representing numbers in the computer and characters. Without the key, the data looks like garbage.

Unfortunately, there are many different character sets and character encodings, ie. many different ways of mapping between bytes, codepoints and characters.

For example, in the character set called ISO 8859-1 (also known as Latin1) the codepoint value for the letter é is 233. In ISO 8859-5, the same codepoint represents the Cyrillic character щ. These character sets contain less than 256 characters and map codepoints to byte values directly. So a codepoint with the value 233 is represented by a single byte with a value of 233. Note however that that byte may represent either é or щ, depending on the context.

Other character sets use a more complicated approach. With the Unicode character set, which covers most characters you are likely to need to use in a single set, that same Cyrillic character щ has a codepoint value of 1097. This is too high a number to be represented by a single byte. Most Web pages use the UTF-8 encoding for Unicode text. In that encoding щ will be represented by two bytes, but the codepoint value is not simply derived from the value of the two bytes - some more complicated decoding is needed. Other Unicode characters map to one, three or four bytes in the UTF-8 encoding.

But UTF-8 is only one of the possible ways of encoding Unicode characters. This means that a codepoint in the Unicode character set can actually be represented by different byte sequences, depending on which encoding was used. The Devanagari character क, with codepoint 2325, can be represented by two bytes (09 15), three bytes (E0 A4 95), or four bytes (00 00 09 15), depending on which encoding was used (here UTF-16, UTF-8, and UTF-32 respectively).

Most of the time you will not need to understand a character encoding at this level of detail. You will just need to be sure that the application you are working with knows which character encoding is appropriate for the data you are working with, and can handle that encoding.

How do fonts fit into this?

A font is a collection of glyphs (shapes) used to display characters.

Once your application has worked out what characters it is dealing with, it will then look in the font for glyphs in order to display or print those characters. (Of course, if the encoding information was wrong, it will be looking up glyphs for the wrong characters.)

A given font will usually cover a single character set, or in the case of a large character set like Unicode, just a subset of all the characters in the set. When your font doesn't have a glyph for a character some applications will look for the missing character in other fonts on your system (which will mean that the glyph will look different from the surrounding text, like a ransom note). Otherwise you will see a square box, a question mark or some other character instead. For example:

mojibake3.gif

How does this affect me?

You need to choose the best encoding for your purposes. Unicode encodings are often a good choice here, since you can use a single encoding to handle pretty much any character you are likely to meet. This greatly simplifies things. Using Unicode throughout your system also removes the need to track and convert various character encodings.

You need to check what encoding your editor or scripts are saving text in, and how to save text in the encoding of your choice. Note, however, that just declaring a different encoding won't change the bytes, you need to save the text in that encoding too.

You need to find out how to declare the character encoding you used for the document format you are working with. You may also need to check that your server is serving documents with the right HTTP declarations.

You need to ensure that the various parts of your system can communicate with each other, understand which character encodings are being used, and support all the necessary encodings and characters.

The links in the next section provide some further reading on these topics.

Further reading