Character Encoding and Unicode

WWW2005 Tutorial: Internationalizing Web Content and Web Technology

10 May 2005, Makuhari, Chiba, Japan

Martin J. Dürst (duerst@it.aoyama.ac.jp)

Department of Integrated Information Technology
College of Science and Engineering
Aoyama Gakuin University
Tokyo/Sagamihara, Japan

Character encoding is a very central and basic necessity for internationalization. For computer communication, characters have to be encoded into bytes. There are very simple encodings, but also more complicated ones. Over the years and around the world, a long list of corporate, national, and regional encodings has developed, which cover different sets of characters. The most complicated and the largest character encodings have been developed and are in use in Asia.

Unicode/ISO 10646 is steadily replacing these encodings in more and more places. Unicode is a single, large set of characters including all presently used scripts of the world, with remaining historic scripts being added. Unicode comes with two main encodings, UTF-8 and UTF-16, both very well designed for specific purposes. Because Unicode includes all the characters of all the well-used legacy encodings, mapping from older encodings to Unicode is usually not a problem, although there are some issues where care is necessary in particular for East Asian character encodings.

Character Encoding Basics

To represent characters on a computer, they have to be:
- selected (character repertoire)
- represented as numbers (coded character set, character table)
- ultimately, represented as bits and bytes (character encoding)

In general, character encoding deals with how to denote characters by more basic or primitive elements, such as numbers, bytes (octets) or bits. This includes a number of separable decisions, and a number of abstract layers of representation, which we will look at in greater detail later. For the moment, we will use the term encoding somewhat loosely.

Short History of Character Encodings

General developments:

More and more characters, more and more bits per character
Encodings for larger and larger communities (local, corporate, national, supranational, global)
More coding complexity (code switching,...)
More and more complexity (direct key-character-glyph mapping -> complex relationship)

The history of character encodings contains many ingenious designs, but also quite a few accidental developments. The search for the best encoding always to some extent was in conflict with the need to use a common encoding that met many needs, even if somewhat incompletely.

A brief history of character encoding is provided in Richard Gillam, Unicode Demystified, pp. 25-59.

More and More Characters

5-bit encodings
6-bit encodings
7-bit encodings (US-ASCII, ISO 646 national variants,...)
8-bit encodings (EBCDIC, ISO 8859-x series,...)
multibyte encodings

One tendency that can clearly be identified in the history of character encodings is the increase in the number of characters in a typical encoding. This increase is mainly due to the strong limitations of memory and display/printing capabilities of early technology.

Complicated Encoding Schemes

Multibyte encodings (e.g. shift_JIS, euc-jp,...)
- difficult to identify characters within a byte stream
- adaption of program from single byte to multibyte requires huge work
- different multibyte encodings require different adaptions
Code switching (e.g. iso-2022-jp, ISO 2022 in general)
- even more difficult to handle than multibyte encodings
- difficult to get accurate information on some parts

The Less Encodings, the Better

Ad-hoc and research-based encodings
Corporate encodings
National encodings
Supranational encodings

Character encodings in most cases started out with a large variety of encodings, but converged sooner or later.

Unicode to the Rescue

Basic idea: A single encoding for the whole world

Originally two separate projects:

ISO/IEC 10646 (by ISO/IEC JTC1 SC2 WG2)
The Unicode Standard (Unicode Consortium)

Merged between 1991 and 1993 to avoid two global encodings.

For simplicity, this talk uses the term Unicode for the common product unless it is ambiguous.

What is Unicode?

Two standards, very closely in sync
A set of characters
A table of characters, with a number for each character
Three encodings
A lot of help for working with characters

We use the term Unicode here for what is essentially two standards.

Unicode: Two Standards

The Unicode Standard, available as a book (ISBN 0-321-18578-1) and online
ISO/IEC 10646, available on a CD
ISO/IEC 10646 translated into many national variants (e.g. JIS X 0221)

Unicode: A Set of Characters

Identifying characters:
- What scripts
- What characters (as opposed to non-characters such as logos, icons,...)
- What characters (what level of abstraction,...)
Encoding at the center of input/rendering/processing/storage
Unification (which characters are the same, which are different)
- One and the same character should only be encoded once, independent of language
- For small scripts, solution is obvious or can be handled case-by-case

The term character set has been used in various ways in the industry. Here we are talking about a set in the mathematical sense. To avoid confusion, this is often also called a character repertoire.

Han Unification

For CJKV ideographs, more systematic approach needed:

Based on unification rules used in Japanese national standard
Which is oriented towards average users/usage
"clearly different" shapes: separate code points (e.g. 体 U+4F53 vs. 體 U+9AD4)
"closely similar" shapes: separate code points if separate meanings (e.g. 大、犬、太、土、士)
otherwise "closely similar" shapes: one code point (e.g. , U+8FBB)

Unicode: A Table of Characters

Original ISO/IEC design (32-bit): 128 groups of 256 planes
Original Unicode design (16-bit): 16 bits per character, 2¹⁶ characters
Final sturcture: 1+16=17 planes:
- BMP (base multilingual plane, plane 0, modern-use)
- Plane 1, SMP (supplementary multilingual plane, historic scripts,...)
- Plane 2, SIP (Supplementary Ideographic Plane, really rare ideographs)
- Planes 3-13: currently unused
- Plane 14, SSP (Supplementary Special-purpose Plane, tags, variant selectors,...)
- Planes 15 and 16: Private Use

Blocks

Character numbers are hexadecimal, using digits 0-9 and A-F
- Notation: U+hhhh, with hhhh being 4-6 hex digits
First 128 positions: identical to US-ASCII
First 265 positions: identical to ISO 8859-1 (Latin-1)
Characters are arranged by script and function in blocks
- Smallest blocks: Kanbun, Katakana Phonetic Extensions,... (16 characters each)
- Largest block: CJK Unified Ideographs Extension B, from U+20000 to U+2A6DF (42720 characters)
Maybe more than one block for a script (example: 0600..06FF Arabic; 0750..077F Arabic Supplement)

Identifying the characters to encode is the first step; the next step is to give each character a number. and to arrange the characters in a table.

Code Points and Special Characters

Some numbers are not used, or serve a special function
- Non-character code-points
- Private use code points
Some 'characters' don't look or work like what we think about characters
- Control characters
- Formatting characters

Code point is used for numbers that are not really characters

When emphasizing the fact of looking at the numbers or positions in the table, which may or may not be occupied by real characters, the term code point is often used.

Unicode: Three Encodings

UTF-8
- Multibyte encoding with nice properties
- ASCII-compatible in a very strong sense
UTF-16
- Variable-length encoding, most characters 16 bits
- Internal encoding for many applications and systems
UTF-32
- Very simple and straightforward
- Used on some Unix systems for internal processing

It would have been nice if a single encoding for Unicode addressed all encoding needs. Unfortunately, due to @@@@, this is not (yet?) the case. Unicode defines three encodings with different size code unit for different purposes.

UTF-8 Structure

from	to	usage	byte number
from	to	usage	1	2	3	4
U+0000	U+007F	US-ASCII	0xxx xxxx	-	-	-
U+0080	U+07FF	Latin,..., Arabic	110x xxxx	10xx xxxx	-	-
U+0800	U+FFFF	rest of BMP	1110 xxxx	10xx xxxx	10xx xxxx	-
U+10000	U+10FFFF	non-BMP	1111 0xxx	10xx xxxx	10xx xxxx	10xx xxxx

Only shortest encoding allowed.

UTF-8 Properties

Clear roles of byte values (single: 0..., initial: 11..., trailing: 10..., unused: C0, C1, F5-FF; cf. e.g. with Shift_JIS)
Easy to detect start of character in byte stream, synchronize in case of errors
Easy to distinguish from other encodings based on byte pattern
No overlapping matches
Strictly protects US-ASCII, important for many protocols and operating systems
Binary sorting identical to UCS-4/UTF-32
Reasonably compact for ASCII-heavy text

See The Properties and Promises of UTF-8

UTF-8 Usage

Default encoding for XML
Many protocols
IRI->URI conversion
Some file systems (getting popular for Unix/Linux file names)
Internal processing in some applications

UTF-16 Structure

Code unit is 16 bits (two bytes, one half-word)
One code unit for characters in BMP
Two code units for characters in Planes 1-16
2048 "code points" in BMP reserved for UTF-16
- High surrogates: D800-DBFF
- Low surrogates: DC00-DFFF

The term code unit was specially created to deal with the fact that encoding characters with UTF-16 is a tree-step process, with code units as an additional step between code points and bytes.

UTF-16 Properties and Usage

Reasonably compact and efficient for well-used characters
Endianness problem:
16-bit values get stored differently on big-endian and little-endian machines
Solutions:
- Byte Order Mark: U+FEFF; reverse (U+FFFE) is not a character
- Explicit labeling: UTF-16BE and UTF-16LE
Usage: Mostly for internal processing
- Microsoft Windows
- Java

UTF-32 Structure and Properties

Each character is directly represented as a 32-byte code unit (one word)
Straightforward, but rather inefficient encoding
Used internally on some Unix-like systems and applications, and for certain kinds of processing
Endianness problems and solutions are similar to those of UTF-16

Unicode: A lot of Help for Working with Characters

Only in the Unicode Standard, not part of ISO/IEC 10646

Rendering and processing
Conversion from/to legacy (non-Unicode) encodings
Bidirectional rendering
Sorting
Normalization
...

Unicode and Legacy Encodings

Round-tripping (compatibility characters)
Escaping
Minor points (XML Japanese Profile,...)
- Vendor-specific extensions
- minor mapping differences
Formatting characters,...

Conclusion and Questions

[and a break!]