XML Japanese Profile addresses the issues of using Japanese characters in XML documents. In particular, ambiguities in converting existing Japanese charsets to Unicode are clearly pointed out.

Status of this document

By publishing this document, W3C acknowledges that Panasonic, Toshiba, Academia Sinica, Alis Technologies, Sun Microsystems have made a formal submission to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. Publication of acknowledged Member Submissions at the W3C site is one of the benefits of W3C Membership. Please consult the requirements associated with Member Submissions of section 3.3 of the W3C Patent Policy. Please consult the complete list of acknowledged W3C Member Submissions. See also Submission request and Team Comment.

A list of current W3C technical documents can be found at the Technical Reports page.

This specification is technically identical to [JIS TR X 0015]. The XML SWG intends to keep this document and [JIS TR X 0015] in sync.

Appendices

2. Normative References

Any of the following JISs refers to its latest version. If an RFC is superseded by another RFC, the new RFC is referenced.

IANA Official Names for Character Sets: IANA (Internet Assigned Numbers Authority). Official Names for Character Sets, ed. Keld Simonsen et al. http://www.iana.org/assignments/character-sets
IETF RFC 1468: IETF (Internet Engineering Task Force). RFC 1468: Japanese character encoding for Internet messages, ed. J. Murai, M. Crispin, and E. van der Poel. 1993.
IETF RFC 2045: IETF (Internet Engineering Task Force). RFC 2045: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, ed. N. Freed and N. Borenstein. 1996.
IETF RFC 2046: IETF (Internet Engineering Task Force). RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types, ed. N. Freed and N. Borenstein. 1996.
IETF RFC 2130: IETF (Internet Engineering Task Force). RFC 2130: The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996, C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg. 1997.
IETF RFC 2277: IETF (Internet Engineering Task Force). RFC 2277: IETF policy on character sets and languages, ed. H. Alvestrand. 1998.
IETF RFC 2278: IETF (Internet Engineering Task Force). RFC 2278: IANA Charset Registration Procedures, ed. N. Freed and J. Postel. 1998.
IETF RFC 2279: IETF (Internet Engineering Task Force). RFC 2279: UTF-8, a transformation format of ISO 10646, ed. F. Yergeau. 1998.
IETF RFC 2396: IETF (Internet Engineering Task Force). RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax, ed. T. Berners-Lee, R. Fielding, and L. Masinter. 1998.
IETF RFC 2616: IETF (Internet Engineering Task Force). RFC2616: Hypertext Transfer Protocol --- HTTP/1.1, ed. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee. 1999.
IETF RFC 2781: IETF (Internet Engineering Task Force). RFC 2781: UTF-16, an encoding of ISO 10646, ed. P. Hoffman. and F. Yergeau. 2000.
IETF RFC 3023: IETF (Internet Engineering Task Force). RFC 3023: XML media types, ed. M. Murata, S. St.Laurent, and D. Kohn. 2001.
IETF RFC 3066: IETF (Internet Engineering Task Force). RFC 3066: Tags for the Identification of Languages, ed. H. Alvestrand. 2001.
ISO/IEC 646: International Organization for Standardization. Information technology — ISO 7-bit coded character set for information interchange
ISO/IEC10646 (all parts): International Organization for Standardization. Information technology — Universal Multiple-Octet Coded Character Set (UCS)
JIS TR X 0015: Japanese Industrial Standards Committee. XML Japanese Profile
JIS X 0201: Japanese Industrial Standards Committee. 7-bit and 8-bit coded character sets for information interchange
JIS X 0208: Japanese Industrial Standards Committee. 7-bit and 8-bit double byte coded KANJI sets for information interchange
JIS X 0212: Japanese Industrial Standards Committee. Code of the supplementary Japanese graphic character set for information interchange
JIS X 0221-1: Japanese Industrial Standards Committee. Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane
JIS X 4159: Japanese Industrial Standards Committee. Extensible Markup Language (XML) 1.0
TOG/JVC ucs-conv: TOG/JVC CDE/Motif Technical WG. Problems and Solutions for Unicode and User/Vendor Defined Characters, OSF Japanese Vendors Council (OSF/JVC), http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html , 1996.
Unicode 3.2: The Unicode Consortium. The Unicode Standard, Version 3.2, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #28: Unicode 3.1 (http://www.unicode.org/reports/tr28 ).
UNIX International: UNIX International. UNIX SYSTEM V Release 4 Nihongo Kankyou Kyoutuu Kiyaku (Common specifications for the Japanese computing environement), Toppan, 1992.
US-ASCII: American National Standards Institute, Coded character set — 7-bit American National Standard Code for Information Interchange, ANSI X3.4-1986.
XML: W3C (Worlde Wide Web Consortium). XML (Extensible Markup Language) 1.0 (Second Edition), W3C Recommendation, http://www.w3.org/TR/REC-xml , 1998.

NOTE: [JIS X 0221-1] corresponds to Part 1 of [ISO/IEC10646 (all parts)].

NOTE: [JIS X 4159] corresponds to [XML].

5. Character Encoding Schemes

5.1 UTF-16

This technical report recommends the use of UTF-16 (or UTF-8, as written in UTF-8).

NOTE: UTF-16 is a CES that represents all characters contained in the first 17 planes specified by [ISO/IEC10646 (all parts)] or [Unicode 3.2]. UTF-16 is specified by [Unicode 3.2]. Also refer to [IETF RFC 2781] for an explanation of UTF-16.

NOTE: [XML] requires any XML processor to read entities in UTF-16 (or UTF-8, as described in UTF-8).

NOTE: The charset name for UTF-16 is "utf-16".

NOTE: Note that UTF-16 as specified in [XML] is the charset "utf-16" rather than the charsets "utf-16le" and "utf-16be", where "utf-16be" is Big Endian UTF-16 without the BOM and "utf-16le" is Little Endian UTF-16 without the BOM. These three charsets are registered at [IANA Official Names for Character Sets].

5.2 UTF-8

This technical report recommends the use of UTF-8 (or UTF-16, as UTF-16 describes).

NOTE: UTF-8 is a CES which covers all of the characters specified by [ISO/IEC10646 (all parts)] or [Unicode 3.2]. UTF-8 is specified by [Unicode 3.2]. See also [IETF RFC 2279] for another explanation of UTF-8.

NOTE: [XML] requires any XML processor to read entities in UTF-8.

NOTE: The charset name for UTF-8 is utf-8.

NOTE: Charset name "utf-8" is registered at [IANA Official Names for Character Sets].

5.3 Shift-JIS

This technical report and [XML] treat Shift-JIS, an ordinary Japanese charset, as a CES that represents Japanese characters and [US-ASCII] characters in [ISO/IEC10646 (all parts)] or [Unicode 3.2]. For full interoperability in the Internet, migration from Shift-JIS to UTF-8/UTF-16 is highly recommended.

NOTE: The CCS for [XML] is always [ISO/IEC10646 (all parts)] or [Unicode 3.2]. Thus, Shift-JIS cannot be handled as a CES of some other CCS.

NOTE: Note that Japanese characters here include Halfwidth Katakana, the yen sign and the overline described by [JIS X 0201].

There are four major conversion tables from Shift-JIS to [ISO/IEC10646 (all parts)] or [Unicode 3.2]. This technical report names them x-sjis-unicode-0_9, x-sjis-jisx0221-1995, windows-31J, and x-sjis-jdk1_1_7, respectively. These conversion tables are not identical to each other.

NOTE: Other conversion tables are also in use.

X-sjis-unicode-0_9 is published by Unicode Consortium as the conversion table for Shift-JIS (version 0.9). X-sjis-jisx0221-1995 is a conversion table derived from the conversion table in [JIS X 0221-1] Appendix 2 (including Appendix 2.2) for the shift encoding which is specified in [JIS X 0208] Appendix 1. Windows-31J is published by Unicode Consortium as the conversion table for Microsoft CP932. X-sjis-jdk1_1_7 is the conversion table used for the encoding named SJIS in JDK 1.1.7.

NOTE: Conversion tables from Unicode Consortium are available at http://www.unicode.org/Public/MAPPINGS/ .

NOTE: Use of Shift-JIS cannot provide interoperability in information interchange, since any of the above-mentioned conversion tables or some other conversion tables might be used.

NOTE: It is generally assumed that Shift-JIS uses [JIS X 0201] rather than [US-ASCII]. This assumption applies to all of these conversion tables except for x-sjis-jdk1_1_7. [IETF RFC 2046] deprecates the use of any national or application-oriented version of [ISO/IEC 646] in Internet mail, except when it is completely identical to US-ASCII.

NOTE: Although Microsoft CP932 is said to be based on [JIS X 0201], the conversion table for Microsoft CP932 maps 0x5C to U+005C (REVERSE SOLIDUS).

NOTE: Other than Japanese characters, CP932 contains NEC special characters, NEC-selected IBM extended characters, IBM extension characters, and user-defined characters.

NOTE: Ambiguities in conversion from Shift-JIS to Unicode provides further information on the yen sign.

5.4 Japanese EUC (Compressed)

This technical report and [XML] treat Japanese EUC (Compressed) [UNIX International], an ordinary Japanese charset in Japan, as a CES that represents Japanese characters and [US-ASCII] characters in [ISO/IEC10646 (all parts)] or [Unicode 3.2]. For full interoperability in the Internet, migration from Japanese EUC (Compressed) to UTF-8/UTF-16 is highly recommended.

NOTE: The coded character set for [XML] is always [ISO/IEC10646 (all parts)] or [Unicode 3.2]. Thus, Japanese EUC (Compressed) cannot be handled as a CES of some other CCS.

There are five major conversion tables from Japanese EUC (Compressed) to [ISO/IEC10646 (all parts)] or [Unicode 3.2]. This technical report names them x-eucjp-unicode-0_9, x-eucjp-jisx0221-1995, x-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201 and x-eucjp-open-19970715-ascii, respectively. These conversion tables are not identical to each other.

X-eucjp-unicode-0_9 is derived from the conversion table published by Unicode Consortium for conversion from [JIS X 0208] to [ISO/IEC10646 (all parts)] or [Unicode 3.2]. X-eucjp-jisx0221-1995 is derived from the one in Appendix 2 (including Appendix 2.2) of [JIS X 0221-1]. X-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201 and x-eucjp-open-19970715-ascii are conversion tables defined by OSF Japanese Vendors Council (OSF/JVC) and they are named eucJP-ms, eucJP-0201 and eucJP-ascii by Appendix of [TOG/JVC ucs-conv].

NOTE: Use of Japanese EUC cannot provide interoperability in information interchange, since any of these conversion tables or some other conversion table might be used.

NOTE: X-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201 and x-eucjp-open-19970715-ascii contain NEC special characters, NEC-selected IBM extended characters, IBM extension characters, and user-defined characters.

5.5 ISO-2022-JP

This technical report and [XML] assume ISO-2022-JP[IETF RFC 1468], an ordinary Japanese charset in Japan, as a CES that represents Japanese characters and [US-ASCII] characters in [ISO/IEC10646 (all parts)] or [Unicode 3.2]. For full interoperability in the Internet, migration from ISO-2022-JP to UTF-8/UTF-16 is highly recommended.

NOTE: The coded character set for [XML] is always [ISO/IEC10646 (all parts)] or [Unicode 3.2]. Thus, ISO-2022-JP cannot be handled as a CES of some other CCS.

This technical report defines conversion from ISO-2022-JP to [ISO/IEC10646 (all parts)] or [Unicode 3.2] via Shift-JIS or Japanese EUC; that is, ISO-2022-JP is first converted to Shift-JIS or Japanese EUC and then converted by one of the tables shown in Shift-JIS or Japanese EUC (Compressed). Therefore, for each conversion table from Japanese EUC or Shift-JIS, one conversion table from ISO-2022-JP is constructed.

NOTE: ISO-2022-JP is a charset designed for message transmission such as EMAIL. One can thus safely assume that information in ISO-2022-JP was temporarilly converted from Shift-JIS or Japanese EUC for message transmission. Therefore, it is reasonable to convert ISO-2022-JP to [ISO/IEC10646 (all parts)] or [Unicode 3.2] via Shift-JIS or Japanese EUC.

NOTE: [IETF RFC 1468] allows those characters in [US-ASCII], [JIS X 0201], and[JIS X 0208] only. However, some implementations represent Microsoft CP932 in ISO-2022-JP by allowing NEC special characters, NEC-selected IBM extended characters, IBM extension characters, and user-defined characters. x-iso2022jp-cp932 refers to this variation of ISO-2022-JP.

After omitting identical conversion tables, five conversion tables are obtained. This technical report names them x-iso2022jp-unicode-0_9, x-iso2022jp-jisx0221-1995, x-iso2022jp-cp932, x-iso2022jp-jdk1_1_7, and x-iso2022jp-19970715-ascii. These conversion tables are not identical to each other.

Correspondences between ISO-2022-JP and Shift-JIS or Japanese EUC
	Conversion from Shift-JIS or Japanese EUC	Conversion from ISO-2022-JP
Shift-JIS	x-sjis-jdk1_1_7	x-iso2022jp-jdk1_1_7
	x-sjis-unicode-0_9	x-iso2022jp-unicode-0_9
	x-sjis-jisx0221-1995	x-iso2022jp-jisx0221-1995
	windows-31J	x-iso2022jp-cp932
Japanese EUC	x-eucjp-open-19970715-ms	x-iso2022jp-cp932
	x-eucjp-open-19970715-0201	x-iso2022jp-jisx0221-1995
	x-eucjp-open-19970715-ascii	x-iso2022jp-19970715-ascii

An escape sequence of [IETF RFC 1468] (1B 24 42, 1B 24 4A, 1B 28 42, or 1B 28 4A) is an error if it occurs before the end of an encoding declaration in an ISO-2022-JP XML constituent. Results are undefined.

NOTE: [XML] specifies that an XML processor must report an error and stop normal processing when it is unable to process the employed CES. Most XML processors cannot handle occurrences of bit combination 1B before the end of an encoding declaration. Such an occurrence is defined as an error (whose results are undefined) rather than a fatal error (which requires suspension) in order not to void XML processors based on an existing code conversion library. If it is defined as a fatal error, such XML processors might become non-conformant.

NOTE: Use of ISO-2022-JP cannot provide interoperability in information interchange, since any of these conversion tables or some other conversion tables might be used.

Appendices

A Name Characters (Non-Normative)

All graphic characters contained in 10th area and later of [JIS X 0208] and 5th area and later of [JIS X 0212] are name start characters.

Table 1 lists name start characters in 1st through 9th areas of [JIS X 0208]. Table 2 and 3 list other name characters in these areas and name characters in 1st through 4th areas, respectively. All name characters in [JIS X 0212] are name start characters.

NOTE: In [XML], name characters are characters which can be used as part of element and attribute names, and name start characters are those name characters with which names may begin.

NOTE: This chapter merely adapts [XML] name characters to [JIS X 0208] and [JIS X 0212].

Name start characters: 1st through 9th areas of [JIS X 0208]
	KUTEN code	UCS code points
仝	01-24	U+4edd
〇	01-27	U+3007
Å	02-82	U+212b
ぁ	04-01	U+3041
あ	04-02	U+3042
ぃ	04-03	U+3043
い	04-04	U+3044
ぅ	04-05	U+3045
う	04-06	U+3046
ぇ	04-07	U+3047
え	04-08	U+3048
ぉ	04-09	U+3049
お	04-10	U+304a
か	04-11	U+304b
が	04-12	U+304c
き	04-13	U+304d
ぎ	04-14	U+304e
く	04-15	U+304f
ぐ	04-16	U+3050
け	04-17	U+3051
げ	04-18	U+3052
こ	04-19	U+3053
ご	04-20	U+3054
さ	04-21	U+3055
ざ	04-22	U+3056
し	04-23	U+3057
じ	04-24	U+3058
す	04-25	U+3059
ず	04-26	U+305a
せ	04-27	U+305b
ぜ	04-28	U+305c
そ	04-29	U+305d
ぞ	04-30	U+305e
た	04-31	U+305f
だ	04-32	U+3060
ち	04-33	U+3061
ぢ	04-34	U+3062
っ	04-35	U+3063
つ	04-36	U+3064
づ	04-37	U+3065
て	04-38	U+3066
で	04-39	U+3067
と	04-40	U+3068
ど	04-41	U+3069
な	04-42	U+306a
に	04-43	U+306b
ぬ	04-44	U+306c
ね	04-45	U+306d
の	04-46	U+306e
は	04-47	U+306f
ば	04-48	U+3070
ぱ	04-49	U+3071
ひ	04-50	U+3072
び	04-51	U+3073
ぴ	04-52	U+3074
ふ	04-53	U+3075
ぶ	04-54	U+3076
ぷ	04-55	U+3077
へ	04-56	U+3078
べ	04-57	U+3079
ぺ	04-58	U+307a
ほ	04-59	U+307b
ぼ	04-60	U+307c
ぽ	04-61	U+307d
ま	04-62	U+307e
み	04-63	U+307f
む	04-64	U+3080
め	04-65	U+3081
も	04-66	U+3082
ゃ	04-67	U+3083
や	04-68	U+3084
ゅ	04-69	U+3085
ゆ	04-70	U+3086
ょ	04-71	U+3087
よ	04-72	U+3088
ら	04-73	U+3089
り	04-74	U+308a
る	04-75	U+308b
れ	04-76	U+308c
ろ	04-77	U+308d
ゎ	04-78	U+308e
わ	04-79	U+308f
ゐ	04-80	U+3090
ゑ	04-81	U+3091
を	04-82	U+3092
ん	04-83	U+3093
ァ	05-01	U+30a1
ア	05-02	U+30a2
ィ	05-03	U+30a3
イ	05-04	U+30a4
ゥ	05-05	U+30a5
ウ	05-06	U+30a6
ェ	05-07	U+30a7
エ	05-08	U+30a8
ォ	05-09	U+30a9
オ	05-10	U+30aa
カ	05-11	U+30ab
ガ	05-12	U+30ac
キ	05-13	U+30ad
ギ	05-14	U+30ae
ク	05-15	U+30af
グ	05-16	U+30b0
ケ	05-17	U+30b1
ゲ	05-18	U+30b2
コ	05-19	U+30b3
ゴ	05-20	U+30b4
サ	05-21	U+30b5
ザ	05-22	U+30b6
シ	05-23	U+30b7
ジ	05-24	U+30b8
ス	05-25	U+30b9
ズ	05-26	U+30ba
セ	05-27	U+30bb
ゼ	05-28	U+30bc
ソ	05-29	U+30bd
ゾ	05-30	U+30be
タ	05-31	U+30bf
ダ	05-32	U+30c0
チ	05-33	U+30c1
ヂ	05-34	U+30c2
ッ	05-35	U+30c3
ツ	05-36	U+30c4
ヅ	05-37	U+30c5
テ	05-38	U+30c6
デ	05-39	U+30c7
ト	05-40	U+30c8
ド	05-41	U+30c9
ナ	05-42	U+30ca
ニ	05-43	U+30cb
ヌ	05-44	U+30cc
ネ	05-45	U+30cd
ノ	05-46	U+30ce
ハ	05-47	U+30cf
バ	05-48	U+30d0
パ	05-49	U+30d1
ヒ	05-50	U+30d2
ビ	05-51	U+30d3
ピ	05-52	U+30d4
フ	05-53	U+30d5
ブ	05-54	U+30d6
プ	05-55	U+30d7
ヘ	05-56	U+30d8
ベ	05-57	U+30d9
ペ	05-58	U+30da
ホ	05-59	U+30db
ボ	05-60	U+30dc
ポ	05-61	U+30dd
マ	05-62	U+30de
ミ	05-63	U+30df
ム	05-64	U+30e0
メ	05-65	U+30e1
モ	05-66	U+30e2
ャ	05-67	U+30e3
ヤ	05-68	U+30e4
ュ	05-69	U+30e5
ユ	05-70	U+30e6
ョ	05-71	U+30e7
ヨ	05-72	U+30e8
ラ	05-73	U+30e9
リ	05-74	U+30ea
ル	05-75	U+30eb
レ	05-76	U+30ec
ロ	05-77	U+30ed
ヮ	05-78	U+30ee
ワ	05-79	U+30ef
ヰ	05-80	U+30f0
ヱ	05-81	U+30f1
ヲ	05-82	U+30f2
ン	05-83	U+30f3
ヴ	05-84	U+30f4
ヵ	05-85	U+30f5
ヶ	05-86	U+30f6
Α	06-01	U+0391
Β	06-02	U+0392
Γ	06-03	U+0393
Δ	06-04	U+0394
Ε	06-05	U+0395
Ζ	06-06	U+0396
Η	06-07	U+0397
Θ	06-08	U+0398
Ι	06-09	U+0399
Κ	06-10	U+039a
Λ	06-11	U+039b
Μ	06-12	U+039c
Ν	06-13	U+039d
Ξ	06-14	U+039e
Ο	06-15	U+039f
Π	06-16	U+03a0
Ρ	06-17	U+03a1
Σ	06-18	U+03a3
Τ	06-19	U+03a4
Υ	06-20	U+03a5
Φ	06-21	U+03a6
Χ	06-22	U+03a7
Ψ	06-23	U+03a8
Ω	06-24	U+03a9
α	06-33	U+03b1
β	06-34	U+03b2
γ	06-35	U+03b3
δ	06-36	U+03b4
ε	06-37	U+03b5
ζ	06-38	U+03b6
η	06-39	U+03b7
θ	06-40	U+03b8
ι	06-41	U+03b9
κ	06-42	U+03ba
λ	06-43	U+03bb
μ	06-44	U+03bc
ν	06-45	U+03bd
ξ	06-46	U+03be
ο	06-47	U+03bf
π	06-48	U+03c0
ρ	06-49	U+03c1
σ	06-50	U+03c3
τ	06-51	U+03c4
υ	06-52	U+03c5
φ	06-53	U+03c6
χ	06-54	U+03c7
ψ	06-55	U+03c8
ω	06-56	U+03c9
А	07-01	U+0410
Б	07-02	U+0411
В	07-03	U+0412
Г	07-04	U+0413
Д	07-05	U+0414
Е	07-06	U+0415
Ё	07-07	U+0401
Ж	07-08	U+0416
З	07-09	U+0417
И	07-10	U+0418
Й	07-11	U+0419
К	07-12	U+041a
Л	07-13	U+041b
М	07-14	U+041c
Н	07-15	U+041d
О	07-16	U+041e
П	07-17	U+041f
Р	07-18	U+0420
С	07-19	U+0421
Т	07-20	U+0422
У	07-21	U+0423
Ф	07-22	U+0424
Х	07-23	U+0425
Ц	07-24	U+0426
Ч	07-25	U+0427
Ш	07-26	U+0428
Щ	07-27	U+0429
Ъ	07-28	U+042a
Ы	07-29	U+042b
Ь	07-30	U+042c
Э	07-31	U+042d
Ю	07-32	U+042e
Я	07-33	U+042f
а	07-49	U+0430
б	07-50	U+0431
в	07-51	U+0432
г	07-52	U+0433
д	07-53	U+0434
е	07-54	U+0435
ё	07-55	U+0451
ж	07-56	U+0436
з	07-57	U+0437
и	07-58	U+0438
й	07-59	U+0439
к	07-60	U+043a
л	07-61	U+043b
м	07-62	U+043c
н	07-63	U+043d
о	07-64	U+043e
п	07-65	U+043f
р	07-66	U+0440
с	07-67	U+0441
т	07-68	U+0442
у	07-69	U+0443
ф	07-70	U+0444
х	07-71	U+0445
ц	07-72	U+0446
ч	07-73	U+0447
ш	07-74	U+0448
щ	07-75	U+0449
ъ	07-76	U+044a
ы	07-77	U+044b
ь	07-78	U+044c
э	07-79	U+044d
ю	07-80	U+044e
я	07-81	U+044f

Name characters other than name start characters: 1st through 9th areas of [JIS X 0208]
	KUTEN code	UCS code points
ヽ	01-19	U+30fd
ヾ	01-20	U+30fe
ゝ	01-21	U+309d
ゞ	01-22	U+309e
々	01-25	U+3005
ー	01-28	U+30fc

Name start characters: 1st through 4th ares of [JIS X 0212]
	KUTEN code	UCS code points
Ά	06-65	U+0386
Έ	06-66	U+0388
Ή	06-67	U+0389
Ί	06-68	U+038a
Ϊ	06-69	U+03aa
Ό	06-71	U+038c
Ύ	06-73	U+038e
Ϋ	06-74	U+03ab
Ώ	06-76	U+038f
ά	06-81	U+03ac
έ	06-82	U+03ad
ή	06-83	U+03ae
ί	06-84	U+03af
ϊ	06-85	U+03ca
ΐ	06-86	U+0390
ό	06-87	U+03cc
ς	06-88	U+03c2
ύ	06-89	U+03cd
ϋ	06-90	U+03cb
ΰ	06-91	U+03b0
ώ	06-92	U+03ce
Ђ	07-34	U+0402
Ѓ	07-35	U+0403
Є	07-36	U+0404
Ѕ	07-37	U+0405
І	07-38	U+0406
Ї	07-39	U+0407
Ј	07-40	U+0408
Љ	07-41	U+0409
Њ	07-42	U+040a
Ћ	07-43	U+040b
Ќ	07-44	U+040c
Ў	07-45	U+040e
Џ	07-46	U+040f
ђ	07-82	U+0452
ѓ	07-83	U+0453
є	07-84	U+0454
ѕ	07-85	U+0455
і	07-86	U+0456
ї	07-87	U+0457
ј	07-88	U+0458
љ	07-89	U+0459
њ	07-90	U+045a
ћ	07-91	U+045b
ќ	07-92	U+045c
ў	07-93	U+045e
џ	07-94	U+045f
Æ	09-01	U+00c6
Đ	09-02	U+0110
Ħ	09-04	U+0126
Ł	09-08	U+0141
Ŋ	09-11	U+014a
Ø	09-12	U+00d8
Œ	09-13	U+0152
Ŧ	09-15	U+0166
Þ	09-16	U+00de
æ	09-33	U+00e6
đ	09-34	U+0111
ð	09-35	U+00f0
ħ	09-36	U+0127
ı	09-37	U+0131
ĸ	09-39	U+0138
ł	09-40	U+0142
ŋ	09-43	U+014b
ø	09-44	U+00f8
œ	09-45	U+0153
ß	09-46	U+00df
ŧ	09-47	U+0167
þ	09-48	U+00fe
Á	10-01	U+00c1
À	10-02	U+00c0
Ä	10-03	U+00c4
Â	10-04	U+00c2
Ă	10-05	U+0102
Ǎ	10-06	U+01cd
Ā	10-07	U+0100
Ą	10-08	U+0104
Å	10-09	U+00c5
Ã	10-10	U+00c3
Ć	10-11	U+0106
Ĉ	10-12	U+0108
Č	10-13	U+010c
Ç	10-14	U+00c7
Ċ	10-15	U+010a
Ď	10-16	U+010e
É	10-17	U+00c9
È	10-18	U+00c8
Ë	10-19	U+00cb
Ê	10-20	U+00ca
Ě	10-21	U+011a
Ė	10-22	U+0116
Ē	10-23	U+0112
Ę	10-24	U+0118
Ĝ	10-26	U+011c
Ğ	10-27	U+011e
Ģ	10-28	U+0122
Ġ	10-29	U+0120
Ĥ	10-30	U+0124
Í	10-31	U+00cd
Ì	10-32	U+00cc
Ï	10-33	U+00cf
Î	10-34	U+00ce
Ǐ	10-35	U+01cf
İ	10-36	U+0130
Ī	10-37	U+012a
Į	10-38	U+012e
Ĩ	10-39	U+0128
Ĵ	10-40	U+0134
Ķ	10-41	U+0136
Ĺ	10-42	U+0139
Ľ	10-43	U+013d
Ļ	10-44	U+013b
Ń	10-45	U+0143
Ň	10-46	U+0147
Ņ	10-47	U+0145
Ñ	10-48	U+00d1
Ó	10-49	U+00d3
Ò	10-50	U+00d2
Ö	10-51	U+00d6
Ô	10-52	U+00d4
Ǒ	10-53	U+01d1
Ő	10-54	U+0150
Ō	10-55	U+014c
Õ	10-56	U+00d5
Ŕ	10-57	U+0154
Ř	10-58	U+0158
Ŗ	10-59	U+0156
Ś	10-60	U+015a
Ŝ	10-61	U+015c
Š	10-62	U+0160
Ş	10-63	U+015e
Ť	10-64	U+0164
Ţ	10-65	U+0162
Ú	10-66	U+00da
Ù	10-67	U+00d9
Ü	10-68	U+00dc
Û	10-69	U+00db
Ŭ	10-70	U+016c
Ǔ	10-71	U+01d3
Ű	10-72	U+0170
Ū	10-73	U+016a
Ų	10-74	U+0172
Ů	10-75	U+016e
Ũ	10-76	U+0168
Ǘ	10-77	U+01d7
Ǜ	10-78	U+01db
Ǚ	10-79	U+01d9
Ǖ	10-80	U+01d5
Ŵ	10-81	U+0174
Ý	10-82	U+00dd
Ÿ	10-83	U+0178
Ŷ	10-84	U+0176
Ź	10-85	U+0179
Ž	10-86	U+017d
Ż	10-87	U+017b
á	11-01	U+00e1
à	11-02	U+00e0
ä	11-03	U+00e4
â	11-04	U+00e2
ă	11-05	U+0103
ǎ	11-06	U+01ce
ā	11-07	U+0101
ą	11-08	U+0105
å	11-09	U+00e5
ã	11-10	U+00e3
ć	11-11	U+0107
ĉ	11-12	U+0109
č	11-13	U+010d
ç	11-14	U+00e7
ċ	11-15	U+010b
ď	11-16	U+010f
é	11-17	U+00e9
è	11-18	U+00e8
ë	11-19	U+00eb
ê	11-20	U+00ea
ě	11-21	U+011b
ė	11-22	U+0117
ē	11-23	U+0113
ę	11-24	U+0119
ǵ	11-25	U+01f5
ĝ	11-26	U+011d
ğ	11-27	U+011f
ġ	11-29	U+0121
ĥ	11-30	U+0125
í	11-31	U+00ed
ì	11-32	U+00ec
ï	11-33	U+00ef
î	11-34	U+00ee
ǐ	11-35	U+01d0
ī	11-37	U+012b
į	11-38	U+012f
ĩ	11-39	U+0129
ĵ	11-40	U+0135
ķ	11-41	U+0137
ĺ	11-42	U+013a
ľ	11-43	U+013e
ļ	11-44	U+013c
ń	11-45	U+0144
ň	11-46	U+0148
ņ	11-47	U+0146
ñ	11-48	U+00f1
ó	11-49	U+00f3
ò	11-50	U+00f2
ö	11-51	U+00f6
ô	11-52	U+00f4
ǒ	11-53	U+01d2
ő	11-54	U+0151
ō	11-55	U+014d
õ	11-56	U+00f5
ŕ	11-57	U+0155
ř	11-58	U+0159
ŗ	11-59	U+0157
ś	11-60	U+015b
ŝ	11-61	U+015d
š	11-62	U+0161
ş	11-63	U+015f
ť	11-64	U+0165
ţ	11-65	U+0163
ú	11-66	U+00fa
ù	11-67	U+00f9
ü	11-68	U+00fc
û	11-69	U+00fb
ŭ	11-70	U+016d
ǔ	11-71	U+01d4
ű	11-72	U+0171
ū	11-73	U+016b
ų	11-74	U+0173
ů	11-75	U+016f
ũ	11-76	U+0169
ǘ	11-77	U+01d8
ǜ	11-78	U+01dc
ǚ	11-79	U+01da
ǖ	11-80	U+01d6
ŵ	11-81	U+0175
ý	11-82	U+00fd
ÿ	11-83	U+00ff
ŷ	11-84	U+0177
ź	11-85	U+017a
ž	11-86	U+017e
ż	11-87	U+017c

B Needs for Japanese XML profile (Non-Normative)

[XML] adopts [ISO/IEC10646 (all parts)] or [Unicode 3.2] as the CCS, which contains all Japanese characters. UTF-8 and UTF-16 are the recommended CESs, and implementations are required to support them. Other existing CESs are optionally allowed, as long as they represent characters in [ISO/IEC10646 (all parts)] or [Unicode 3.2] only.

[XML], however, provides little information on existing CESs already in use for the interchange of Japanese characters. Such CESs are allowed as mere options among many others. Furthermore, [XML] says nothing about the appropriate CESs for each protocol (e.g. SMTP or HTTP) and those for information exchange files.

The mapping between such existing CESs and [ISO/IEC10646 (all parts)]/[Unicode 3.2] is not specified either. Some mutually different conversions are in use, and thus different XML processors may emit different outputs.

This technical report addresses existing CESs and clarifies open issues. Although problems with the use of such CESs are not solved, the nature of these problems has become clear.

C Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)

There are four main ambiguities in conversion from Shift-JIS to [ISO/IEC10646 (all parts)] and [Unicode 3.2].

First, 0x5C and 0x7E are respectively converted to the yen sign and the overline by x-sjis-unicode-0_9 and x-sjis-jisx0221-1995, but respectively converted to backslash and tilde by windows-31J and x-sjis-jdk1_1_7.

Second, windows-31J is the only conversion table which provides peculiar mapping of 0x8160(WAVE DASH), 0x8161(DOUBLE VERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN) and 0x081CA(NOT SIGN).

Third, x-sjis-jisx0221-1995 is the only conversion table which maps 0x815C(FULLWIDTH EM DASH) to U+2014(EM DASH); the other conversion tables map it to U+2015(HORIZONTAL BAR).

Fourth, windows-31J is the only conversion table which contains NEC special characters, NEC-selected IBM extended characters, IBM extended characters, and user-defined characters.

Ambiguities in conversion from Shift-JIS to [Unicode 3.2]/[ISO/IEC10646 (all parts)]
Octets in Shift-JIS	x-sjis-jdk1_1_7	x-sjis-unicode-0_9	x-sjis-jisx0221-1995	windows-31J
0x5C(YEN SIGN)	U+005C(REVERSE SOLIDUS)	U+00A5(YEN SIGN)	U+00A5(YEN SIGN)	U+005C(REVERSE SOLIDUS)
0x7E(OVERLINE)	U+007E(TILDE)	U+203E(OVERLINE)	U+203E(OVERLINE)	U+007E(TILDE)
0x815C(FULLWIDTH EM DASH)	U+2015(HORIZONTAL BAR)	U+2015(HORIZONTAL BAR)	U+2014(EM DASH)	U+2015(HORIZONTAL BAR)
0x815F(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0x8160(WAVE DASH)	U+301C(WAVE DASH)	U+301C(WAVE DASH)	U+301C(WAVE DASH)	U+FF5E(FULLWIDTH TILDE)
0x8161(DOUBLEVERTICAL LINE)	U+2016(DOUBLEVERTICAL LINE)	U+2016(DOUBLEVERTICAL LINE)	U+2016(DOUBLEVERTICAL LINE)	U+2225(PARALLEL TO)
0x817C(MINUS SIGN)	U+2212(MINUS SIGN)	U+2212(MINUS SIGN)	U+2212(MINUS SIGN)	U+FF0D(FULLWIDTH HYPHEN-MINUS)
0x8191(CENT SIGN)	U+00A2(CENT SIGN)	U+00A2(CENT SIGN)	U+00A2(CENT SIGN)	U+FFE0(FULLWIDTH CENT SIGN)
0x8192(POUND SIGN)	U+00A3(POUND SIGN)	U+00A3(POUND SIGN)	U+00A3(POUND SIGN)	U+FFE1(FULLWIDTH POUND SIGN)
0x81CA(NOT SIGN)	U+00AC(NOT SIGN)	U+00AC(NOT SIGN)	U+00AC(NOT SIGN)	U+FFE2(FULLWIDTH NOT SIGN)
Extended characters in 13th, 89th-92th, and 115th-119th rows	None	None	None	Included

D Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)

X-eucjp-unicode-0_9 and x-eucjp-jisx0221-1995 cover only the characters defined by [US-ASCII], [JIS X 0208], and [JIS X 0212] and Halfwidth Katakana defined by [JIS X 0201]. In addition to these characters, conversion tables defined by OSF Japanese Vendors Council (OSF/JVC) [TOG/JVC ucs-conv], namely x-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201, and x-eucjp-open-19970715-ascii, cover IBM extension characters(0x8FF3F3-0x8FF4FE), NEC special characters (0xADA1-0xADFC), user-defined characters(0xF5A1-0xFEFE, and 0x8FF5A1-0x8FFEFE).

D.1 Range 0x20-0x7E ([US-ASCII] or [JIS X 0201])

As defined in the original specification of the Japanese EUC, the range 0x20-0x7E of x-eucjp-unicode-0_9, x-eucjp-jisx0221-1995, x-eucjp-open-19970715-ms, and x-eucjp-open-19970715-ascii is assumed to be [US-ASCII]. Charset x-eucjp-open-19970715-0201 is an exception, since this range is assumed to be [JIS X 0201] and converted as below:

Conversion in x-eucjp-open-19970715-0201
Octets in EUC	UCS code point
0x5C(REVERSE SOLIDUS)	U+00A5(YEN SIGN)
0x7E(TILDE)	U+203E(OVERLINE)

D.2 Range 0x8EA1-0x8EDF(Halfwidth Katakana)

Halfwidth Katakana in [JIS X 0201] are allocated to the range 0x8EA1-0x8EDF. These code points are converted to the range U+xFF61-U+xFF9F in the compatilibity area of [ISO/IEC10646 (all parts)] and [Unicode 3.2] by any of the conversion tables.

D.3 Ambiguities in encoding the characters of [JIS X 0208] and [JISX 0212]

As for those characters specified by [JIS X 0208] and [JIS X 0212], charsets x-eucjp-jisx0221-1995 and x-eucjp-open-19970715-0201 provide an identical mapping.

X-eucjp-unicode-0_9 is different from these two charsets only in the mapping of 0xA1BD (EM DASH/ DASH(FULLWIDTH)).

Charset x-eucjp-open-19970715-ascii is different in the mapping of four code points, namely 0xA1B1(OVERLINE/NEGATION SIGN),0xA1C0(REVERSE SOLIDUS), 0xA1EF(YEN SIGN), and 0x8FA2B7(TILDE).

Charset x-eucjp-open-19970715-ms is different in the mapping of 0xA1BD(FULLWIDTH EM DASH), 0xA1C0(REVERSE SOLIDUS), 0xA1C1(WAVE DASH), 0xA1C2(DOUBLE VERTICAL LINE), 0xA1DD(MINUS SIGN), 0xA1F1(CENT SIGN), 0xA1F2(POUND SIGN), 0xA2CC(NOT SIGN), 0x8FA2B7(TILDE), and 0x8FA2C3(BROKEN BAR).

D.4 Ambiguities in IBM Extended Characters, NEC Special Characters, and User-defined Characters

Charsets x-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201, x-eucjp-open-19970715-ascii contain IBM extended characters, NEC special characters, and (the area of) user-defined characters. These charsets provide an identical mapping of these characters. Charsets x-eucjp-unicode-0_9 and x-eucjp-jisx0221-1995, on the other hand, do not contain these characters.

Ambiguities in conversion from Japanese EUC to [Unicode 3.2]/[ISO/IEC10646 (all parts)](1)
Octets in Japanese EUC	x-eucjp-unicode-0_9	x-eucjp-jisx0221-1995	x-eucjp-open-19970715-ms
0x5C(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)
0x7E(TILDE)	U+007E(TILDE)	U+007E(TILDE)	U+007E(TILDE)
0xA1B1(OVERLINE/NEGATION SIGN)	U+FFE3(FULLWIDTH MACRON)	U+FFE3(FULLWIDTH MACRON)	U+FFE3(FULLWIDTH MACRON)
0xA1BD(FULLWIDTH EM DASH)	U+2015(HORIZONTAL BAR)	U+2014(EM DASH)	U+2015(HORIZONTAL BAR)
0xA1C0(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0xA1C1(WAVE DASH)	U+301C(WAVE DASH)	U+301C(WAVE DASH)	U+FF5E(FULLWIDTH TILDE)
0xA1C2(DOUBLE VERTICAL LINE)	U+2016(DOUBLE VERTICAL LINE)	U+2016(DOUBLE VERTICAL LINE)	U+2225(PARALLEL TO)
0xA1DD(MINUS SIGN)	U+2212(MINUS SIGN)	U+2212(MINUS SIGN)	U+FF0D(FULLWIDTH HYPHEN-MINUS)
0xA1EF(YEN SIGN)	U+FFE5(FULLWIDTH YEN SIGN)	U+FFE5(FULLWIDTH YEN SIGN)	U+FFE5(FULLWIDTH YEN SIGN)
0xA1F1(CENT SIGN)	U+00A2(CENT SIGN)	U+00A2(CENT SIGN)	U+FFE0(FULLWIDTH CENT SIGN)
0xA1F2(POUND SIGN)	U+00A3(POUND SIGN)	U+00A3(POUND SIGN)	U+FFE1(FULLWIDTH POUND SIGN)
0xA2CC(NOT SIGN)	U+00AC(NOT SIGN)	U+00AC(NOT SIGN)	U+FFE2(FULLWIDTH NOT SIGN)
0x8FA2B7(TILDE)	U+007E(TILDE)	U+007E(TILDE)	U+FF5E(FULLWIDTH TILDE)
0x8FA2C3(BROKEN BAR)	U+00A6(BROKEN BAR)	U+00A6(BROKEN BAR)	U+FFE4(FULLWIDTH BROKEN BAR)

Ambiguities in conversion from Japanese EUC to [Unicode 3.2]/[ISO/IEC10646 (all parts)](2)
Octets in Japanese EUC	x-eucjp-open-19970715-0201	x-eucjp-open-19970715-ascii
0x5C(REVERSE SOLIDUS)	U+00A5(YEN SIGN)	U+005C(REVERSE SOLIDUS)
0x7E(TILDE)	U+203E(OVERLINE)	U+007E(TILDE)
0xA1B1(OVERLINE/NEGATION SIGN)	U+FFE3(FULLWIDTH MACRON)	U+203E(OVERLINE)
0xA1BD(FULLWIDTH EM DASH)	U+2014(EM DASH)	U+2014(EM DASH)
0xA1C0(REVERSE SOLIDUS)	U+005C(REVERSE SOLIDUS)	U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0xA1C1(WAVE DASH)	U+301C(WAVE DASH)	U+301C(WAVE DASH)
0xA1C2(DOUBLE VERTICAL LINE)	U+2016(DOUBLE VERTICAL LINE)	U+2016(DOUBLE VERTICAL LINE)
0xA1DD(MINUS SIGN)	U+2212(MINUS SIGN)	U+2212(MINUS SIGN)
0xA1EF(YEN SIGN)	U+FFE5(FULLWIDTH YEN SIGN)	U+00A5(YEN SIGN)
0xA1F1(CENT SIGN)	U+00A2(CENT SIGN)	U+00A2(CENT SIGN)
0xA1F2(POUND SIGN)	U+00A3(POUND SIGN)	U+00A3(POUND SIGN)
0xA2CC(NOT SIGN)	U+00AC(NOT SIGN)	U+00AC(NOT SIGN)
0x8FA2B7(TILDE)	U+007E(TILDE)	U+FF5E(FULLWIDTH TILDE)
0x8FA2C3(BROKEN BAR)	U+00A6(BROKEN BAR)	U+00A6(BROKEN BAR)

E Conversion tables for Shift-JIS and Japanese EUC (Non-Normative)

Conversion tables represented in the format of [Character Mapping Markup Language] are referenced from this appendix.

E.1 Shift-JIS

E.1.1 Conversion table fragments

E.1.2 Conversion tables for Shift-JIS

E.2 EUC-JP

E.2.1 Conversion table fragments

Character Mapping Markup Language: The Unicode Consortium. Character Mapping Markup Language, Unicode Technical Report 22, http://www.unicode.org/unicode/reports/tr22/
Murata et al.: Murata, Dürst, Nicol. Recommendation for the charset parameter: a mechanism for specifying character encoding schemes, http://www.fxis.co.jp/DMS/sgml/xml/saloon/html_correct_charset.html , 1998

H Changes from the first edition W3C Note 14 April 2000 (Non-Normative)

References are updated.
The charset "windows-31j" is used instead of "X-sjis-cp932".
Fixed syntactically incorrect charset names in the previous version.

Unicode	utf-16
Unicode	utf-8
Shift-JIS	x-sjis-unicode-0_9
	x-sjis-jisx0221-1995
	windows-31J
	x-sjis-jdk1_1_7
Japanese EUC (Compressed)	x-eucjp-unicode-0_9
	x-eucjp-jisx0221-1995
	x-eucjp-open-19970715-ms
	x-eucjp-open-19970715-0201
	x-eucjp-open-19970715-ascii
ISO-2022-JP	x-iso2022jp-unicode-0_9
	x-iso2022jp-jisx0221-1995
	x-iso2022jp-cp932
	x-iso2022jp-jdk1_1_7
	x-iso2022jp-19970715-ascii

XML Japanese Profile (Second Edition)

W3C Member Submission 24 March 2005

Abstract

Status of this document

Table of Contents

Appendices

1. Scope

2. Normative References

3. Definitions

3.1 Japanese Characters

3.2 Coded Character Sets

3.3 Character Encoding Schemes

3.4 Charsets

3.5 XML Constituents

4. Coded Character Sets

4.1 JIS X 0208:1978

4.2 Compatibility Characters

5. Character Encoding Schemes

5.1 UTF-16

5.2 UTF-8

5.3 Shift-JIS

5.4 Japanese EUC (Compressed)

5.5 ISO-2022-JP

6. Charset Names

6.1 Charsets for an XML document containing Japanese characters

6.2 Code Conversion during Transmission

6.3 Storing transmitted XML constituents to files for information interchange

7. XML Constituents in Files for Information Interchange

8. Delivering XML Constituents by HTTP 1.1

9. Delivering XML Constituents via EMAIL or NEWS

10. Avoiding Conversion Ambiguities

11. The xml:lang Attribute

Appendices

A Name Characters (Non-Normative)

B Needs for Japanese XML profile (Non-Normative)

C Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)

D Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)

D.1 Range 0x20-0x7E ([US-ASCII] or [JIS X 0201])

D.2 Range 0x8EA1-0x8EDF(Halfwidth Katakana)

D.3 Ambiguities in encoding the characters of [JIS X 0208] and [JISX 0212]

D.4 Ambiguities in IBM Extended Characters, NEC Special Characters, and User-defined Characters

E Conversion tables for Shift-JIS and Japanese EUC (Non-Normative)

E.1 Shift-JIS

E.1.1 Conversion table fragments

E.1.2 Conversion tables for Shift-JIS

E.2 EUC-JP

E.2.1 Conversion table fragments

E.2.2 Conversion tables for Japanese EUC

F Charset parameters in HTTP1.1 (Non-Normative)

G Non-normative references (Non-Normative)

H Changes from the first edition W3C Note 14 April 2000 (Non-Normative)