ウェブの国際化
WEBテキスト処理法
2004年5月18日
Martin J. Dürst
(テュールスト
マーティン ヤコブ)
Internationalization
Activity Lead, W3C
[this document is at http://www.w3.org/People/D%c3%bcrst/SFC/2004/0418Hagino.html]
Overview
- Localization: Adapting software (or hardware, standards,...) to local
conditions (例: 日本語化)
- Internationalization (国際化, I18N: I, followed by 18
letters, followed by N): Adapting software (or ...) to deal with varied
conditions worldwide
- Main issues:
- Character encoding/input/rendering/processing
- Language information/negotiation
- Conventions for representing numbers/dates/addresses/...
- Style differences
- Cultural differences (privacy, acceptable content,...)
- Data is represented in computers, on storage, and on networks in units
called bytes
- Virtually all computers, storage devices, and networks these days have
8 bits/byte
- Hexadecimal representation:
十進数 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
ビット |
0000 |
0001 |
0010 |
0011 |
0100 |
0101 |
0110 |
0111 |
1000 |
1001 |
1010 |
1011 |
1100 |
1101 |
1110 |
1111 |
十六進数 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
- Two hexadecimal digits are needed for one byte. Notation e.g.
0x4B
, \x4B
, <4B>
,...
- Order of bytes is different with Intel CPUs (little-endian) and with
most others (big-endian)
- Assume a number that is represented in two bytes (C short): decimal
10000, hexadecimal 2710
- big-endian, this is <27><10>
- little-endian, this is <10><27>
- Bytes can represent anything (numbers,...)
- Computers also easily manipulate 2 bites (16 bits), 4 bytes (32 bits),
and some of them 8 bytes (64 bits)
- A convention is needed for how to represent text
- Simplest solution: Use one byte per character
- Allows to represent 256 different characters (including control
characters)
- Example: <4B> is 'K', <65> is 'e', <69> is 'i',
<6F> is 'o', thus <4B><65><69><6F>
stands for 'Keio' (A)
- Another example: <D2> is 'K', <85> is 'e', <89>
is 'i', <96> is 'o', thus
<D2><85><89><96>stands for 'Keio' (E)
- Standard for character encoding is indispensable to make storage and
communication work
- The above examples use ASCII (A, most computers) and EBCDIC (E; IBM
mainframes)
- For 'basic English', 7 bits would be enough
- 8 bits are enough for various European regions/languages
- Some languages need more than 256 characters
- One byte is not enough, need more than one byte
- Two general approaches: wide characters (fixed width) and multibyte
(variable width)
- Select characters to encode:
- coverage: languages (言語), scripts
(文字種)
- characters to include (frequency, special characters)
- unify or distinguish (characters are not glyphs)
- result: character repertoire
- Assign each character a number (result: coded character
set)
- Define encoding from number to bit/byte pattern (result: character
encoding)
- Define an identifier:
charset
(HTTP/HTML) or
encoding
(XML) parameter, registered with IANA
(Internet Assigned Numbers Authority)
- Coded Character Sets:
- JIS X 0201: Variant of ACSII (tilde/overbar, backslash/yen), can be
used together with half-width kana
- JIS X 0208: ca. 6000 Kanji, basic Japanese CCS
- JIS X 0212: Additional Kanji, not very widely supported
- JIS X 0213: Finalized recently, also not very widely supported
- Character Encodings:
- Each character encoding uses one byte for JIS X 0201/ASCII, and two
bytes for JIS X 0208
- JIS (charset=iso-2022-jp):
- Switching between JIS X 0201 and JIS X 0208 using escape
sequences
- Used in internet mail (originally, internet mail did not pass
through 8-bit data)
- EUC (charset=euc-jp):
- Switching using the high bit of a byte (high bit set => JIS
208)
- Can also include half-width Kana and JIS X 0212, but rarely
used
- Used mainly on Unix systems (EUC: Extended Unix Code)
- Shift-JIS (charset=shift_jis):
- Switching using the high bit of a byte (high bit of first byte
set => JIS 208)
- Complicated transformation of numbers to byte values
- Frequent use of half-width Kana
- Used mainly on PCs (Windows/Mac)
- Problems:
- Confusion because more than one encoding is used (people see
garbage or programs use autodetection)
- Programming is difficult because strings always have to be scanned
from the start
- National/regional encodings are difficult to combine
- Better idea: Create a new single encoding for the whole world
- Created in the 1990ties: ISO/IEC 10646 / Unicode (also adapted as JIS X
0221)
- Code structure: 17 planes à 256x256 characters
- Character examples:
- 'K': LATIN CAPITAL LETTER K: U+004B
- 'ü': LATIN CAPITAL LETTER U WITH DIAERESIS: U+00FC
- 'ε': GREEK SMALL LETTER EPSILON: U+03B5
- '慶': U+6176
Unicode: One Code Table, Several Encodings
- UTF-8: Variable width encoding, very useful
properties
- UTF-16: Mostly one short (16 bits) per character, can represent up to
1MB characters
- UTF-32: One word (32 bits) per character, restricted to 17*64k
characters
- [UCS-4: One word (32 bits) per character]
- [UCS-2: One short (16 bits) per character, only 64K characters]
- UTF-16,... have endianness problems when sent over 8-bit byte networks
=> UTF-16BE, UTF-16BE
- Usage: UTF-8 (and UTF-16) for interchange, UTF-16 (and UTF-8 or UCS-4)
for processing
UTF-8 Patterns
bytes |
1st byte |
2nd byte |
3rd byte |
4th byte |
payload bits |
1 |
0xxx xxxx |
|
|
|
7 |
2 |
110x xxxx |
10xx xxxx |
|
|
8-11 |
3 |
1110 xxxx |
10xx xxxx |
10xx xxxx |
|
12-16 |
4 |
1111 0xxx |
10xx xxxx |
10xx xxxx |
10xx xxxx |
17-21 |
No overlong encodings! (security problems)
UTF-8 Example
- '慶': U+6176
- In bits:
0110 0001 0111 0110
- We need 15 payload bits, therefore 3 bytes
0110 00 0101 11 0110
1110 xxxx 10xx xxxx 10xx xxxx
1110 0110 1000 0101 1011 0110
- <E6><85><B6>
- Exercise: Look up your name in Unicode, and convert to UTF-8 (help)
- Started with RFC
2070, later integrated into HTML 4.0
- Logically, a document is in Unicode/ISO 10646
- Implementations have to work "as if they used Unicode
internally"
- Escape syntax based on Unicode: e.g.
慶應
is always 慶應
- Physically, many character encodings can be used. The encoding has to
be clearly indicated
Indicating Character Encoding on the Web
- 'Same' Kanji from Japanese, Chinese (PRC and Taiwan), Korean, and
Vietnam (+Hong Kong,...) are unified: They get only one number
- This is similar to unification in other scripts
- This is similar to unification for characters in local encodings (e.g.
JIS)
- Some unification is always necessary (no unification, no
communication)
- The borderline for Kanji unification is less obvious than for small
scripts
Kanji Unification Guidelines
- XYZ-Model (developed in Japan for JIS 208)
- X-Axis: Semantic difference (e.g. 机: Japanese: table;
Chinese: simplification of 機)
- Y-Axis: Abstract shape (Japanese: 字体; e.g. 体
vs. 體, 應 vs. 応)
- Z-Axis:Concrete shape, typeface (Japanese: 字形,
'small' difference; e.g. one or two dots in
辻、逗、通,...)
- Basic unification rule: X-Axis and Z-Axis differences are unified,
Y-Axis differences are not unified
- Additional unification rules:
- Small shape differences are not unified if there is also a semantic
difference (example: 士/土,
大/太/犬)
- No unification if codepoints are separated in original source
standards (e.g. JIS 208/212 for Japan)
- Unicode and ISO 10646 are different: Not true, they just define
different aspects of the same coded character set. They are code point by
code point identical
- Unicode can only represent 64K characters: Not (anymore) true, over 1
million characters can be represented with UTF-16
- Unicode contains a lot of useless stuff: True, this is for
backwards/round-trip compatibility, but these characters are clearly
marked
- UTF-8 or UTF-16 are as complicated as Shift_JIS: No, much better, based
on experience with Shif_JIS and others.
Kanji-related Criticism of Unicode (history?)
- Unification of Chinese (simplified and traditional), Japanese, and
Korean Kanji is not appropriate (examples: bone,...)
- Duplication of codepoints for well-known Kanji would lead to great
confusion
- Unification has been done very carefully, close to where an
'average person' would see the same character
- The examples that opponents of unification bring up are few, and
mostly inappropriate
- Readability is never a problem, more codepoints would lead to more
unavailable glyphs
- Contrary to Latin/Greek, which split up more than 2000 years ago,
characters and words have been exchanged between Japan and China
until very recently
- Font differences exist, but which font to choose does not only
depend on language
- Cases where different codepoints might be helpful appear only in
descriptions of the problem (meta-level)
- There are not enough Kanji
- Unicode covers more Kanji for Japan than any other standard
- More are being added (V2.0: about 20'000, V3.0: about 27'000)
- Problem is not number, but:
- Flexibility for various purposes (cannot be provided on the
level of character encoding)
- Additional information (pronunciation, meaning, glyphs,...)
Maybe the
Web can offer a solution?