Skip to notes.The slides are typically visually-oriented graphics, but the notes contain enough information to understand the tutorial.
Slide

One character set, multiple encodings

Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.

For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.

In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes for all characters.

In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

AאChinese ideograph meaning 'stump of tree'.
Code pointU+0041U+05D0U+597DU+233B4
UTF-841D7 90E5 A5 BDF0 A3 8E B4
UTF-1600 4105 D059 7DD8 4C DF B4
UTF-3200 00 00 4100 00 05 D000 00 59 7D00 02 33 B4

Version: $Id: Slide0070.html,v 1.2 2006/02/02 07:54:31 rishida Exp $