ウェブの国際化

WEBテキスト処理法

2004年5月18日

Martin J. Dürst (テュールストマーティンヤコブ)

Internationalization Activity Lead, W3C

[this document is at http://www.w3.org/People/D%c3%bcrst/SFC/2004/0418Hagino.html]

Overview

Internationalization and Localization
Data representation
Representing text: single byte, multibyte, general model
Japanese character encodings
Unicode/ ISO/IEC 106464: An end to character encoding confusion
Indicating character encodings on the Web
Kanji Unification
Criticism of Unicode (history?)

Internationalization and Localization

Localization: Adapting software (or hardware, standards,...) to local conditions (例: 日本語化)
Internationalization (国際化, I18N: I, followed by 18 letters, followed by N): Adapting software (or ...) to deal with varied conditions worldwide
Main issues:
- Character encoding/input/rendering/processing
- Language information/negotiation
- Conventions for representing numbers/dates/addresses/...
- Style differences
- Cultural differences (privacy, acceptable content,...)

Data Representation

Data is represented in computers, on storage, and on networks in units called bytes
Virtually all computers, storage devices, and networks these days have 8 bits/byte

Hexadecimal representation:


十進数	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
ビット	0000	0001	0010	0011	0100	0101	0110	0111	1000	1001	1010	1011	1100	1101	1110	1111
十六進数	`0`	`1`	`2`	`3`	`4`	`5`	`6`	`7`	`8`	`9`	`A`	`B`	`C`	`D`	`E`	`F`

Two hexadecimal digits are needed for one byte. Notation e.g. 0x4B, \x4B, <4B>,...
Order of bytes is different with Intel CPUs (little-endian) and with most others (big-endian)
- Assume a number that is represented in two bytes (C short): decimal 10000, hexadecimal 2710
- big-endian, this is <27><10>
- little-endian, this is <10><27>
Bytes can represent anything (numbers,...)
Computers also easily manipulate 2 bites (16 bits), 4 bytes (32 bits), and some of them 8 bytes (64 bits)

Representing Text (single byte)

A convention is needed for how to represent text
Simplest solution: Use one byte per character
- Allows to represent 256 different characters (including control characters)
- Example: <4B> is 'K', <65> is 'e', <69> is 'i', <6F> is 'o', thus <4B><65><69><6F> stands for 'Keio' (A)
- Another example: <D2> is 'K', <85> is 'e', <89> is 'i', <96> is 'o', thus <D2><85><89><96>stands for 'Keio' (E)
Standard for character encoding is indispensable to make storage and communication work
The above examples use ASCII (A, most computers) and EBCDIC (E; IBM mainframes)
For 'basic English', 7 bits would be enough
8 bits are enough for various European regions/languages

Representing Text (multibyte)

Some languages need more than 256 characters
One byte is not enough, need more than one byte
Two general approaches: wide characters (fixed width) and multibyte (variable width)

Representing Text (model)

Select characters to encode:
- coverage: languages (言語), scripts (文字種)
- characters to include (frequency, special characters)
- unify or distinguish (characters are not glyphs)
- result: character repertoire
Assign each character a number (result: coded character set)
Define encoding from number to bit/byte pattern (result: character encoding)
Define an identifier: charset (HTTP/HTML) or encoding (XML) parameter, registered with IANA (Internet Assigned Numbers Authority)

Japanese Character Encodings

Coded Character Sets:
- JIS X 0201: Variant of ACSII (tilde/overbar, backslash/yen), can be used together with half-width kana
- JIS X 0208: ca. 6000 Kanji, basic Japanese CCS
- JIS X 0212: Additional Kanji, not very widely supported
- JIS X 0213: Finalized recently, also not very widely supported
Character Encodings:
- Each character encoding uses one byte for JIS X 0201/ASCII, and two bytes for JIS X 0208
- JIS (charset=iso-2022-jp):
  - Switching between JIS X 0201 and JIS X 0208 using escape sequences
  - Used in internet mail (originally, internet mail did not pass through 8-bit data)
- EUC (charset=euc-jp):
  - Switching using the high bit of a byte (high bit set => JIS 208)
  - Can also include half-width Kana and JIS X 0212, but rarely used
  - Used mainly on Unix systems (EUC: Extended Unix Code)
- Shift-JIS (charset=shift_jis):
  - Switching using the high bit of a byte (high bit of first byte set => JIS 208)
  - Complicated transformation of numbers to byte values
  - Frequent use of half-width Kana
  - Used mainly on PCs (Windows/Mac)
Problems:
- Confusion because more than one encoding is used (people see garbage or programs use autodetection)
- Programming is difficult because strings always have to be scanned from the start

An End to Encoding Confusion?

National/regional encodings are difficult to combine
Better idea: Create a new single encoding for the whole world
Created in the 1990ties: ISO/IEC 10646 / Unicode (also adapted as JIS X 0221)
Code structure: 17 planes à 256x256 characters
- Plane 0: Base Multilingual Plane
- Plane 1: Supplementary Multilingual Plane
- Plane 2: Supplementary Ideographic Plane
- Plane 14: Supplementary Special-purpose Plane
Character examples:
- 'K': LATIN CAPITAL LETTER K: U+004B
- 'ü': LATIN CAPITAL LETTER U WITH DIAERESIS: U+00FC
- 'ε': GREEK SMALL LETTER EPSILON: U+03B5
- '慶': U+6176

Unicode: One Code Table, Several Encodings

UTF-8: Variable width encoding, very useful properties
UTF-16: Mostly one short (16 bits) per character, can represent up to 1MB characters
UTF-32: One word (32 bits) per character, restricted to 17*64k characters
[UCS-4: One word (32 bits) per character]
[UCS-2: One short (16 bits) per character, only 64K characters]
UTF-16,... have endianness problems when sent over 8-bit byte networks => UTF-16BE, UTF-16BE
Usage: UTF-8 (and UTF-16) for interchange, UTF-16 (and UTF-8 or UCS-4) for processing

UTF-8 Patterns


bytes	1st byte	2nd byte	3rd byte	4th byte	payload bits
1	`0xxx xxxx`				7
2	`110x xxxx`	`10xx xxxx`			8-11
3	`1110 xxxx`	`10xx xxxx`	`10xx xxxx`		12-16
4	`1111 0xxx`	`10xx xxxx`	`10xx xxxx`	`10xx xxxx`	17-21

No overlong encodings! (security problems)

UTF-8 Example

'慶': U+6176
In bits: 0110 0001 0111 0110
We need 15 payload bits, therefore 3 bytes
0110 00 0101 11 0110
1110 xxxx 10xx xxxx 10xx xxxx
1110 0110 1000 0101 1011 0110
<E6><85><B6>
Exercise: Look up your name in Unicode, and convert to UTF-8 (help)

Character Encoding in HTML/XML

Started with RFC 2070, later integrated into HTML 4.0
Logically, a document is in Unicode/ISO 10646
- Implementations have to work "as if they used Unicode internally"
- Escape syntax based on Unicode: e.g. 慶應 is always 慶應
Physically, many character encodings can be used. The encoding has to be clearly indicated

Indicating Character Encoding on the Web

HTTP header: Content-Type: text/html; charset=shift_jis
[参考: how to set; how to check]
HTML <meta>: <meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
XML: <?xml version='1.0' encoding='EUC-JP'?>
[参考: Checking the character encoding using the W3C Markup Validator]

Kanji Unification (包摂)

'Same' Kanji from Japanese, Chinese (PRC and Taiwan), Korean, and Vietnam (+Hong Kong,...) are unified: They get only one number
This is similar to unification in other scripts
This is similar to unification for characters in local encodings (e.g. JIS)
Some unification is always necessary (no unification, no communication)
The borderline for Kanji unification is less obvious than for small scripts

Kanji Unification Guidelines

XYZ-Model (developed in Japan for JIS 208)
- X-Axis: Semantic difference (e.g. 机: Japanese: table; Chinese: simplification of 機)
- Y-Axis: Abstract shape (Japanese: 字体; e.g. 体 vs. 體, 應 vs. 応)
- Z-Axis:Concrete shape, typeface (Japanese: 字形, 'small' difference; e.g. one or two dots in 辻、逗、通,...)
Basic unification rule: X-Axis and Z-Axis differences are unified, Y-Axis differences are not unified
Additional unification rules:
- Small shape differences are not unified if there is also a semantic difference (example: 士/土, 大/太/犬)
- No unification if codepoints are separated in original source standards (e.g. JIS 208/212 for Japan)

General Criticism of Unicode (history?)

Unicode and ISO 10646 are different: Not true, they just define different aspects of the same coded character set. They are code point by code point identical
Unicode can only represent 64K characters: Not (anymore) true, over 1 million characters can be represented with UTF-16
Unicode contains a lot of useless stuff: True, this is for backwards/round-trip compatibility, but these characters are clearly marked
UTF-8 or UTF-16 are as complicated as Shift_JIS: No, much better, based on experience with Shif_JIS and others.

Kanji-related Criticism of Unicode (history?)

Unification of Chinese (simplified and traditional), Japanese, and Korean Kanji is not appropriate (examples: bone,...)
- Duplication of codepoints for well-known Kanji would lead to great confusion
- Unification has been done very carefully, close to where an 'average person' would see the same character
- The examples that opponents of unification bring up are few, and mostly inappropriate
- Readability is never a problem, more codepoints would lead to more unavailable glyphs
- Contrary to Latin/Greek, which split up more than 2000 years ago, characters and words have been exchanged between Japan and China until very recently
- Font differences exist, but which font to choose does not only depend on language
- Cases where different codepoints might be helpful appear only in descriptions of the problem (meta-level)
There are not enough Kanji
- Unicode covers more Kanji for Japan than any other standard
- More are being added (V2.0: about 20'000, V3.0: about 27'000)
- Problem is not number, but:
  - Flexibility for various purposes (cannot be provided on the level of character encoding)
  - Additional information (pronunciation, meaning, glyphs,...)

Maybe the Web can offer a solution?