Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.
For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.
In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.
UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
UTF-32 uses 4 bytes for all characters.
In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.
| A | א | 好 | ||
|---|---|---|---|---|
| Code point | U+0041 | U+05D0 | U+597D | U+233B4 |
| UTF-8 | 41 | D7 90 | E5 A5 BD | F0 A3 8E B4 |
| UTF-16 | 00 41 | 05 D0 | 59 7D | D8 4C DF B4 |
| UTF-32 | 00 00 00 41 | 00 00 05 D0 | 00 00 59 7D | 00 02 33 B4 |
Version: $Id: Slide0070.html,v 1.2 2006/02/02 07:54:31 rishida Exp $