<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE spec SYSTEM "xmlspec-CM.dtd">
<?xml-stylesheet type="text/xsl" href="xmlspec-CM.xsl"?>
<spec w3c-doctype="wd">
  <header>
   <title>Character Model for the World Wide Web 1.0</title>
	 <w3c-designation>WD</w3c-designation>
   <w3c-doctype>W3C Working Draft</w3c-doctype>
	 <pubdate><day>30</day><month>April</month><year>2002</year></pubdate>
	 <publoc>
    <loc href="http://www.w3.org/TR/2002/WD-charmod-20020430/">http://www.w3.org/TR/2002/WD-charmod-20020430</loc>
	 (available in <loc href="http://www.w3.org/TR/2002/WD-charmod-20020430/Overview.xml">XML</loc>, <loc href="http://www.w3.org/TR/2002/WD-charmod-20020430/">HTML</loc>, and as a <loc href="http://www.w3.org/TR/2002/WD-charmod-20020430/charmod.zip">Zip archive</loc>)</publoc>
	 <latestloc>
    <loc href="http://www.w3.org/TR/charmod/">http://www.w3.org/TR/charmod</loc>
	 </latestloc>
	 <prevlocs>
    <loc href="http://www.w3.org/TR/2002/WD-charmod-20020220/">http://www.w3.org/TR/2002/WD-charmod-20020220</loc></prevlocs>
	 <authlist>
		<author><name>Martin J. Dürst</name><affiliation>W3C</affiliation><email href="mailto:duerst@w3.org">duerst@w3.org</email>
		</author>
		<author><name>François Yergeau</name><affiliation>Alis
			 Technologies</affiliation>
		</author>
		<author><name>Richard Ishida</name><affiliation>Xerox Global Services</affiliation><email href="mailto:richard.ishida@gbr.xerox.com">richard.ishida@gbr.xerox.com</email>

		</author>
		<author><name>Misha Wolf</name><affiliation>Reuters
			 Ltd.</affiliation><email href="mailto:misha.wolf@reuters.com">misha.wolf@reuters.com</email>
		</author>
		<author><name>Asmus Freytag</name><affiliation>ASMUS,
			 Inc.</affiliation><email href="mailto:asmus@unicode.org">asmus@unicode.org</email>
		</author>
		<author><name>Tex Texin</name><affiliation>Progress Software
			 Corp.</affiliation><email href="mailto:texin@progress.com">texin@progress.com</email>
		</author>
	 </authlist>
	 <abstract id="abstract">
		<p>This Architectural Specification provides authors of specifications,
		  software developers, and content developers with a common reference for
		  interoperable text manipulation on the World Wide Web. Topics addressed include
		  encoding identification, early uniform normalization, string identity matching,
		  string indexing, and URI conventions, building on the Universal Character Set,
		  defined jointly by Unicode and ISO/IEC 10646. Some introductory material on
		  characters and character encodings is also provided.</p>
	 </abstract>
	 <status id="status">
		<p><emph>This section describes the status of this document at the time
		  of its publication. Other documents may supersede this document. The latest
		  status of this series of documents is maintained at the W3C.</emph></p>
		<p>This is a second Last Call Working Draft for review by W3C Members and 
other interested parties. The Last Call period begins 30 April 2002 and 
ends 31 May 2002.</p><p>This working draft attempts to address review comments that were received 
during the <xspecref href="http://www.w3.org/TR/2001/WD-charmod-20010126/">initial Last Call</xspecref> period, which started 26 January 2001, and 
also incorporates other modifications resulting from continuing 
collaboration with other working groups and continuing work within the <loc href="http://www.w3.org/International/Group/">W3C 
Internationalization Working Group (I18N WG)</loc> (Members only). A
<loc href="http://www.w3.org/International/Group/charmod-lc/">list of comments</loc>
(Members only) with their status is available.</p><p>The I18N WG invites comments on this specification. Due to the 
architectural nature of this document, it affects a large number of W3C 
Working Groups, but also software developers, content developers, and 
writers and users of specifications outside the W3C that have to interface 
with W3C specifications.
Because review comments play an important role in ensuring a high quality specification, we encourage 
readers to review this Last Call Working Draft carefully. Comments 
should preferably be submitted via
the <loc href="http://www.w3.org/2002/05/charmod/LastCall">Last Call Comment Form</loc> (http://www.w3.org/2002/05/charmod/LastCall). Comments may alternatively be submitted by email to <loc href="mailto:www-i18n-comments@w3.org">www-i18n-comments@w3.org</loc> (<loc href="http://lists.w3.org/Archives/Public/www-i18n-comments/">public archive</loc>). In this case, please send one email per comment where possible, otherwise number comments clearly.</p><p>This document is published as part of the <loc href="http://www.w3.org/International/Activity">W3C Internationalization Activity</loc> 
by the Internationalization Working Group, with the help of the 
Internationalization Interest Group. The Internationalization Working Group 
will not allow early implementation to constrain its ability to make 
changes to this specification prior to final release. Publication as a 
Working Draft does not imply endorsement by the W3C Membership. It is 
inappropriate to use W3C Working Drafts as reference material or to cite 
them as other than "work in progress".  A
		  list of current
		  <loc href="http://www.w3.org/TR/">W3C Recommendations and other technical
			 documents</loc> can be found at <loc href="http://www.w3.org/TR/">http://www.w3.org/TR/</loc>.</p>

	 </status><langusage><language id="en">en</language></langusage>
	 <revisiondesc>
    <p>$Id: Overview.xml,v 1.3 2002/04/30 13:07:04 duerst Exp $</p>
	 </revisiondesc></header>
  <body>
	 <div1 id="sec-Intro"><head>Introduction</head>
		<div2 id="sec-GoalsScope">
		  <head>Goals and Scope</head>
		  <p>The goal of this document is to facilitate use of the Web by all
			 people, regardless of their language, script, writing system, and cultural
			 conventions, in accordance with the
			 <titleref href="http://www.w3.org/Consortium/#goals">W3C goal of universal
			 access</titleref>. One basic prerequisite to achieve this goal is to be able to
			 transmit and process the characters used around the world in a well-defined and
			 well-understood way.</p>
		  <p>The main target audience of this document is W3C specification
			 developers. This document defines conformance requirements for other W3C
			 specifications. This document and parts of it can also be referenced from other
			 W3C specifications.</p>
		  <p>Other audiences of this document include software developers,
			 content developers, and authors of specifications outside the W3C. Software
			 developers and content developers implement and use W3C specifications. This
			 document defines some conformance requirements for software developers and
			 content developers that implement and use W3C specifications. It also helps
			 software developers and content developers to understand the character-related
			 provisions in other W3C specifications.</p>
		  <p>The character model described in this document provides authors of
			 specifications, software developers, and content developers with a common
			 reference for consistent, interoperable text manipulation on the World Wide
			 Web. Working together, these three groups can build a more international
			 Web.</p>
		  <p>Topics addressed include encoding identification, early uniform
			 normalization, string identity matching, string indexing, and URI conventions.
			 Some introductory material on characters and character encodings is also
			 provided.</p>
		  <p>Topics not addressed or barely touched include collation (sorting),
			 fuzzy matching and language tagging. Some of these topics may be addressed in a
			 future version of this specification.</p>
		  <p>At the core of the model is the Universal Character Set (UCS),
			 defined jointly by The Unicode Standard <bibref ref="unicode"/> and ISO/IEC
			 10646 <bibref ref="iso10646"/>. In this document, <term>Unicode</term> is used
			 as a synonym for the Universal Character Set. The model will allow Web
			 documents authored in the world's scripts (and on different platforms) to be
			 exchanged, read, and searched by Web users around the world.</p>
		  <p>All W3C specifications must conform to this document (see section
			 <specref ref="sec-Conformance"/>). Authors of other specifications (for
			 example, IETF specifications) are strongly encouraged to take guidance from
			 it.</p>
		  <p>Since other W3C specifications will be based on some of the
			 provisions of this document, without repeating them, software developers
			 implementing W3C specifications must conform to these provisions.</p>
		</div2>
		<div2 id="sec-Background">
		  <head>Background</head>
		  <p>This section provides some historical background on the topics
			 addressed in this document.</p>
		  <p>Starting with <titleref>Internationalization of the Hypertext Markup
			 Language</titleref> <bibref ref="rfc2070"/>, the Web community has recognized
			 the need for a character model for the World Wide Web. The first step towards
			 building this model was the adoption of Unicode as the document character set
			 for HTML.</p>
		  <p>The choice of Unicode was motivated by the fact that Unicode:
			 <ulist>
				<item>
				  <p>is the only universal character repertoire available,</p>
				</item>
				<item>
				  <p>covers the widest possible range,</p>
				</item>
				<item>
				  <p>provides a way of referencing characters independent of the
					 encoding of a resource,</p>
				</item>
				<item>
				  <p>is being updated/completed carefully,</p>
				</item>
				<item>
				  <p>is widely accepted and implemented by industry.</p>
				</item>
			 </ulist></p>
		  <p>W3C adopted Unicode as the document character set for HTML in
			 <bibref ref="html40"/>. The same approach was later used for specifications
			 such as XML 1.0 <bibref ref="xml10"/> and CSS2 <bibref ref="css2"/>. Unicode
			 now serves as a common reference for W3C specifications and applications.</p>
		  <p>The IETF has adopted some policies on the use of character sets on
			 the Internet (see <bibref ref="rfc2277"/>).</p>
		  <p>When data transfer on the Web remained mostly unidirectional (from
			 server to browser), and where the main purpose was to render documents, the use
			 of Unicode without specifying additional details was sufficient. However, the
			 Web has grown:
			 <ulist>
				<item>
				  <p>Data transfers among servers, proxies, and clients, in all
					 directions, have increased.</p>
				</item>
				<item>
				  <p>Non-ASCII characters <bibref ref="MIME"/> are being used in
					 more and more places.</p>
				</item>
				<item>
				  <p>Data transfers between different protocol/format elements
					 (such as element/attribute names, URI components, and textual content) have
					 increased.</p>
				</item>
				<item>
				  <p>More and more APIs are defined, not just protocols and
					 formats.</p>
				</item>
			 </ulist></p>
		  <p>In short, the Web may be seen as a single, very large application
			 (see <bibref ref="Nicol"/>), rather than as a collection of small independent
			 applications.</p>
		  <p>While these developments strengthen the requirement that Unicode be
			 the basis of a character model for the Web, they also create the need for
			 additional specifications on the application of Unicode to the Web. Some
			 aspects of Unicode that require additional specification for the Web include:
			 <ulist>
				<item>
				  <p>Choice of encoding forms (UTF-8, UTF-16, UTF-32).</p>
				</item>
				<item>
				  <p>Counting characters, measuring string length in the presence
					 of variable-length encodings and combining characters).</p>
				</item>
				<item>
				  <p>Duplicate encodings (e.g. precomposed vs decomposed).</p>
				</item>
				<item>
				  <p>Use of control codes for various purposes (e.g.
					 bidirectionality control, symmetric swapping, etc.).</p>
				</item>
			 </ulist></p>
		  <p>It should be noted that such properties also exist in legacy
			 encodings (where <term>legacy encoding</term> is taken to mean any character
			 encoding not based on Unicode), and in many cases have been inherited by
			 Unicode in one way or another from such legacy encodings.</p>
		  <p>The remainder of this document presents additional specifications
			 and requirements to ensure an interoperable character model for the Web, taking
			 into account earlier work (from W3C, ISO and IETF).</p><p>For information about the requirements that informed the development
		  of important parts of this specification, see <titleref>Requirements for String
		  Identity Matching and String Indexing</titleref> <bibref ref="CharReq"/>.</p>
		</div2>
		<div2 id="sec-Notation"><head>Terminology and Notation</head>
		  <p id="def-recipient-producer">For the purpose of this specification, the <term>producer</term> of
			 text data is the sender of the data in the case of protocols, and the tool that
			 produces the data in the case of formats. The <term>recipient</term> of text
			 data is the software module that receives the data.</p>
		  <note>
			 <p>A software module may be both a recipient and a producer.</p>
		  </note>
		  <p>Unicode code points are denoted as U+hhhh, where "hhhh" is a
			 sequence of at least four, and at most six hexadecimal digits.</p>
		</div2>
	 </div1>
	 <div1 id="sec-Conformance"><head>Conformance</head>
		<p>In this document, requirements are expressed using the key words
		  "<rfc2119>MUST</rfc2119>", "<rfc2119>MUST NOT</rfc2119>",
		  "<rfc2119>REQUIRED</rfc2119>", "<rfc2119>SHALL</rfc2119>" and "<rfc2119>SHALL
		  NOT</rfc2119>". Recommendations are expressed using the key words
		  "<rfc2119>SHOULD</rfc2119>", "<rfc2119>SHOULD NOT</rfc2119>" and
		  "<rfc2119>RECOMMENDED</rfc2119>" (see the note below). "<rfc2119>MAY</rfc2119>" and
		  "<rfc2119>OPTIONAL</rfc2119>" are used to indicate optional features or
		  behaviour. These keywords are used in accordance with RFC 2119
		  <bibref ref="rfc2119"/>.</p>
		<note>
			<p>RFC 2119 makes it clear that requirements that use <rfc2119>SHOULD</rfc2119> are not optional and should be complied with unless there are specific reasons not to: <quote>This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.</quote></p></note>
	<p>This specification places conformance requirements on specifications,
		  on software and on Web content. To aid the reader, all requirements are
		  preceded by <qterm>[X]</qterm> where <qchar>X</qchar> is one of <qchar>S</qchar> for specifications, <qchar>I</qchar> for software
		  implementations, and <qchar>C</qchar> for Web content. These markers indicate the relevance
		  of the requirement and allow the reader to quickly locate relevant requirements
		  using the browser's search function.
		  <req><req-type>S</req-type><req-type>I</req-type><req-type>C</req-type><req-text>In
		  order to conform to this document, specifications <rfc2119>MUST NOT</rfc2119>
		  violate any requirements preceded by [S], software <rfc2119>MUST NOT</rfc2119>
		  violate any requirements preceded by [I], and content <rfc2119>MUST
		  NOT</rfc2119> violate any requirements preceded by [C].</req-text></req></p>
		<p><req><req-type>S</req-type><req-text>Every W3C specification
		  <rfc2119>MUST</rfc2119>:</req-text>
		  <olist>
			 <item>
				<p>conform to the requirements applicable to specifications,</p>
			 </item>
			 <item>
				<p>specify that implementations <rfc2119>MUST</rfc2119> conform to
				  the requirements applicable to software, and</p>
			 </item>
			 <item>
				<p>specify that content created according to that specification
				  <rfc2119>MUST</rfc2119> conform to the requirements applicable to content.</p>
			 </item>
		  </olist></req> </p>
		<p><req><req-type>S</req-type><req-text>If an existing W3C specification
		  does not conform to the requirements in this document, then the next version of
		  that specification <rfc2119>SHOULD</rfc2119> be modified in order to
		  conform.</req-text></req></p>
		<p><req><req-type>I</req-type><req-text>Where this specification contains
		  a procedural description, it <rfc2119>MUST</rfc2119> be understood as a way to
		  specify the desired external behavior. Implementations <rfc2119>MAY</rfc2119>
		  use other ways of achieving the same results, as long as observable behavior is
		  not affected.</req-text></req></p>
	 </div1>
	 <div1 id="sec-Characters"><head>Characters</head>
		<div2 id="sec-Perceptions"><head>Perceptions of Characters</head>
		  <div3 id="sec-PerceptionsIntro"><head>Introduction</head>
			 <p>The glossary entry in <bibref ref="unicode30"/> gives:</p>
			 <p><quote>Character. (1) The smallest component of written language
				that has semantic values; refers to the abstract meaning and/or shape
				...</quote></p>
			 <p>The word <qterm>character</qterm> is used in many contexts, with
				different meanings. Human cultures have radically differing writing systems,
				leading to radically differing concepts of a character. Such wide variation in
				end user experience can, and often does, result in misunderstanding. This
				variation is sometimes mistakenly seen as the consequence of imperfect
				technology. Instead, it derives from the great flexibility and creativity of
				the human mind and the long tradition of writing as an important part of the
				human cultural heritage. The alphabetic approach used by scripts such as Latin,
				Cyrillic and Greek is only one of several possibilities.</p><example><p>Japanese
			 hiragana and katakana are syllabaries. A character in these scripts corresponds
			 to a syllable (usually a combination of consonant plus vowel).</p></example> <example><p>Korean Hangul is a featural syllabary that combines symbols for
			 individual sounds of the language into square syllabic blocks. Depending on the
			 user and the application, either the individual symbols or the syllabic
			 clusters can be considered to be characters.</p></example> <example><p>Indic scripts
			 are abugidas. Each consonant letter carries an inherent vowel that is
			 eliminated or replaced using semi-regular or irregular ways to combine
			 consonants and vowels into clusters. Depending on the user and the application,
			 either individual consonants or vowels, or the consonant or consonant-vowel
			 clusters can be perceived as characters.</p></example><example><p>Arabic script is
			 an example of an abjad. Short vowel sounds are typically not written at all.
			 When they are written they are indicated by the use of combining marks placed
			 above and below the consonantal letters.</p></example>
			 <p>The developers of W3C specifications, and the developers of
				software based on those specifications, are likely to be more familiar with
				usages they have experienced and less familiar with the wide variety of usages
				in an international context. Furthermore, within a computing context,
				characters are often confused with related concepts, resulting in incomplete or
				inappropriate specifications and software.</p>
			 <p>This section examines some of these contexts, meanings and
				confusions.</p>
		  </div3>
		  <div3 id="sec-WritingSystem"><head>Units of aural rendering</head>
			 <p>In some scripts, characters have a close relationship to phonemes
				(a <term>phoneme</term> is a minimally distinct sound in the context of a
				particular spoken language), while in others they are closely related to
				meanings. Even when characters (loosely) correspond to phonemes, this
				relationship may not be simple, and there is rarely a one-to-one correspondence
				between character and phoneme.</p><example><p>In the English sentence,
			 <quote>They were too close to the door to close it.</quote> the same character
			 <qchar>s</qchar> is used to represent both /s/ and /z/ phonemes.</p></example> <example><p>In many scripts a single character may represent a sequence of
			 phonemes, such as the syllabic characters of Japanese hiragana.</p></example> <example><p>In many writing systems a sequence of characters may represent a
			 single phoneme, for example <qchar>wr</qchar> and <qchar>ng</qchar> in
			 <quote>writing</quote>.</p></example>
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume that there is a one-to-one
				correspondence between characters and the sounds of a
				language.</req-text></req></p>
		  </div3>
		  <div3 id="sec-VisualRenderingUnits"><head>Units of visual
				rendering</head>
			 <p id="def-glyph">Visual rendering introduces the notion of a <emph>glyph</emph>.
				<term>Glyphs</term> are defined by ISO/IEC 9541-1 <bibref ref="iso9541"/> as
				<quote>a recognizable abstract graphic symbol which is independent of a
				specific design</quote>. There is <emph>not</emph> a one-to-one correspondence
				between characters and glyphs:
				<ulist>
				  <item>
					 <p>A single character can be represented by multiple glyphs
						(each glyph is then part of the representation of that character). These glyphs
						may be physically separated from one another. </p>
				  </item>
				  <item>
					 <p>A single glyph may represent a sequence of characters (this
						is the case with ligatures, among others).</p>
				  </item>
				  <item>
					 <p>A character may be rendered with very different glyphs
						depending on the context.</p>
				  </item>
				  <item>
					 <p>A single glyph may represent different characters (e.g.
						capital Latin A, capital Greek A and capital Cyrillic A).</p>
				  </item>
				</ulist></p>
			 <p>Each glyph can be represented by a number of different glyph
				images; a set of glyph images makes up a <term>font</term>. Glyphs can be
				construed as the basic units of organization of the visual rendering of text,
				just as characters are the basic unit of organization of encoded text.</p>
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume a one-to-one mapping between
				character codes and units of displayed text.</req-text></req></p>
			 <p>See the appendix <specref ref="sec-CharExamples"/> for examples of the
				complexities of character to glyph mapping.</p>
			 <p>Some scripts, in particular Arabic and Hebrew, are written from
				right to left. Text including characters from these scripts can run in both
				directions and is therefore called bidirectional text. The Unicode
				Standard <bibref ref="unicode"/> requires that characters be stored and
				interchanged in logical order. <req><req-type>S</req-type><req-text>Protocols,
				data formats and APIs <rfc2119>MUST</rfc2119> store, interchange or process
				text data in logical order.</req-text></req></p>
			 <p>In the presence of bidirectional text, two possible
				selection modes must be considered. The first is <term>logical selection
				mode</term>, which selects all the characters <emph>logically</emph> located
				between the end-points of the user's mouse gesture. Here the user selects from
				between the first and second letters of the second word to the middle of the
				number. Logical selection looks like this:</p>

<figure>
<table border="1" cellspacing="0" cellpadding="5" summary="Two images contrasting a single logical selection in memory and the resulting two selections on screen, in a bidi context">
<tbody>
 <tr>
  <th>In memory</th>
  <td align="center"><image><graphic source="images/logSelMemory.gif" width="323" height="27"/><alt>In the example used, logical selection is depicted as one highlighted range of characters in memory.</alt></image></td>
 </tr>
 <tr>
  <th>On screen</th>
  <td align="center"><image><graphic source="images/logSelScreen.gif" width="144" height="32"/><alt>The same example, showing highlighted on screen text shows two highlighted ranges of characters.</alt></image></td>
 </tr>
</tbody>
</table>
</figure>

<p>It is a consequence of the bidirectionality of the text that a
				single, continuous logical selection in memory results in a <emph>discontinuous
				selection appearing on the screen</emph>. This discontinuity, as well as the
				somewhat unintuitive behavior of the cursor, makes some users prefer a
				<term>visual selection mode</term>, which selects all the characters
				<emph>visually</emph> located between the end-points of the user's mouse
				gesture. With the same mouse gesture as before, we now obtain:</p>

<figure>
<table border="1" cellpadding="5" cellspacing="0" summary="Two images contrasting a single visual selection on screen and the resulting two selections in memory, in a bidi context">
<tbody><tr><th>In
						memory</th><td align="center"><image><graphic source="images/visSelMemory.gif" width="343" height="27"/><alt>In the example used, visual selection is depicted as two highlighted ranges of characters in memory.</alt></image></td></tr><tr><th>On screen</th><td align="center"><image><graphic source="images/visSelScreen.gif" width="141" height="33"/><alt>The same example again, now shows visual selection on screen highlighting a single range of characters.</alt></image></td></tr></tbody></table></figure>
			 <p>In this mode, a single visual selection range results in
				<emph>two</emph> logical ranges, which have to be accommodated by protocols,
				APIs and implementations.</p>
			 

			 <p><req><req-type>S</req-type><req-text>Specifications of protocols
				and APIs that involve selection of ranges <rfc2119>SHOULD</rfc2119> provide for
				discontiguous selections, at least to the extent necessary to support
				implementation of visual selection on screen on top of those protocols and
				APIs.</req-text></req></p>
		  </div3>
		  <div3 id="sec-InputUnits"><head>Units of input</head>
			 <p>In keyboard input, it is <emph>not</emph> always the case that
				keystrokes and input characters correspond one-to-one. A limited number of keys
				can fit on a keyboard. Some keyboards will generate multiple characters from a
				single keypress. In other cases (<qterm>dead keys</qterm>) a key will generate
				no characters, but affect the results of subsequent keypresses. Many writing
				systems have far too many characters to fit on a keyboard and must rely on more
				complex <term>input methods</term>, which transform keystroke sequences into
				character sequences. Other languages may make it necessary to input some
				characters with special modifier keys. See <specref ref="sec-CharExamples"/>
				for examples of non-trivial input.</p>
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume that a single keystroke results
				in a single character, nor that a single character can be input with a single
				keystroke (even with modifiers), nor that keyboards are the same all over the
				world.</req-text></req></p>
		  </div3>
		  <div3 id="sec-CollationUnits"><head>Units of collation</head>
			 <p>String comparison as used in sorting and searching is based on
				units which do not in general have a one-to-one relationship to encoded
				characters. Such string comparison can aggregate a character sequence into a
				single <term>collation unit</term> with its own position in the sorting order,
				can separate a single character into multiple collation units, and can
				distinguish various aspects of a character (case, presence of diacritics, etc.)
				to be sorted separately (multi-level sorting).</p>
			 <p>In addition, a certain amount of pre-processing may also be
				required, and in some languages (such as Japanese and Arabic) sort order may be
				governed by higher order factors such as phonetics or word roots. Collation
				methods may also vary by application.</p><example><p>In traditional Spanish sorting, the letter sequences
			 <qchar>ch</qchar> and <qchar>ll</qchar> are treated as atomic collation units.
			 Although Spanish sorting, and to some extent Spanish everyday use, treat
			 <qchar>ch</qchar> as a single unit, current digital encodings treat it as two
			 letters, and keyboards do the same (the user types <qchar>c</qchar>, then
			 <qchar>h</qchar>).</p></example><example><p>In most languages, the letter
			 <qchar>æ</qchar> is sorted as two consecutive collation units: <qchar>a</qchar>
			 and <qchar>e</qchar>.</p></example><example><p>The sorting of text written in a
			 bicameral script (i.e. a script which has distinct upper and lower case
			 letters) is usually required to ignore case differences in a first pass; case
			 is then used to break ties in a later pass.</p></example><example><p>Treatment of
			 accented letters in sorting is dependent on the script or language in question.
			 The letter <qchar>ö</qchar> is treated as a modified <qchar>o</qchar> in
			 French, but as a letter completely independent from <qchar>o</qchar> (and
			 sorting after <qchar>z</qchar>) in Swedish. In German certain applications
			 treat the letter <qchar>o</qchar> as if it were the sequence
			 <qchar>oe</qchar>.</p></example><example><p>In Thai the sequence U+0E44 U+0E01 must
			 be sorted as if it was written U+0E01 U+0E44. Reordering is typically done
			 during an initial pre-processing stage.</p></example>
			 <example><p>German dictionaries typically sort <qchar>ä</qchar>, <qchar>ö</qchar> and <qchar>ü</qchar> together with <qchar>a</qchar>, <qchar>o</qchar> and <qchar>u</qchar> respectively.  On the other hand, German telephone books typically sort <qchar>ä</qchar>, <qchar>ö</qchar> and <qchar>ü</qchar> as if they were spelled <qchar>ae</qchar>, <qchar>oe</qchar> and <qchar>ue</qchar>.  Here the application is affecting the collation algorithm used.</p></example><p><req><req-type>S</req-type><req-type>I</req-type><req-text>Software
				that sorts or searches text for users <rfc2119>MUST</rfc2119> do so on the
				basis of appropriate collation units and ordering rules for the relevant
				language and/or application.</req-text></req></p>
		  <p>Note that, where searching or sorting is done dynamically, particularly in
a multilingual environment, the 'relevant language' should be determined to be that of
the current user, and may thus differ from user to user.  <req><req-type>S</req-type><req-type>I</req-type><req-text>Software
that allows users to sort or search text <rfc2119>SHOULD</rfc2119> allow the user to select
alternative rules for collation units and ordering.</req-text></req></p><p><req><req-type>S</req-type><req-type>I</req-type><req-text>When sorting and searching in the context of a particular language, it <rfc2119>MUST</rfc2119> be possible to deal gracefully with strings
being compared that contain Unicode characters  not normally associated with that language.</req-text></req> A default collation order for all Unicode characters can be obtained
from ISO/IEC 14651 <bibref ref="iso14651"/> or from Unicode Technical Report #10, the Unicode Collation Algorithm <bibref ref="UTR10"/>. This default ordering can be used in conjunction with rules tailored for a particular locale
to ensure a predictable ordering and comparison of strings, whatever
characters they include.</p></div3>
		  <div3 id="sec-Storage"><head>Units of storage</head>
			 <p>Computer storage and communication rely on units of physical
				storage and information interchange, such as bits and bytes (also known as
				octets, as nowadays the word bytes is generally considered to mean 8-bit
				bytes). A frequent error in specifications and implementations is the equating
				of characters with units of physical storage. The mapping between characters
				and such units of storage is actually quite complex, and is discussed in the
				next section, <specref ref="sec-Digital"/>.</p>
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume a one-to-one relationship
				between characters and units of physical storage.</req-text></req></p>
		  </div3>
		  <div3 id="sec-PerceptionsOutro"><head>Summary</head>
			 <p>The term <term>character</term> is used differently in a variety
				of contexts and often leads to confusion when used outside of these contexts.
				In the context of the digital representations of text, a character can be
				defined informally as a small logical unit of text. <term>Text</term> is then
				defined as sequences of characters. While such an informal definition is
				sufficient to create or capture a common understanding in many cases, it is
				also sufficiently open to create misunderstandings as soon as details start to
				matter. In order to write effective specifications, protocol implementations,
				and software for end users, it is very important to understand that these
				misunderstandings can occur.</p>
			 <p><req><req-type>S</req-type><req-text>When specifications use the
				term <qterm>character</qterm> it <rfc2119>MUST</rfc2119> be clear which of the
				possible meanings they intend.</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>SHOULD</rfc2119>
				avoid the use of the term <qterm>character</qterm> if a more specific term is
				available.</req-text></req></p>
		  </div3>
		</div2>
		<div2 id="sec-Digital"><head>Digital Encoding of Characters</head>
		  <p>To be of any use in computers, in computer communications and in
			 particular on the World Wide Web, characters must be encoded. In fact, much of
			 the information processed by computers over the last few decades has been
			 encoded text, exceptions being images, audio, video and numeric data. To
			 achieve text encoding, a large variety of encoding schemes have been devised,
			 which can loosely be defined as mappings between the character sequences that
			 users manipulate and the sequences of bits that computers manipulate.</p>
		  <p>Given the complexity of text encoding and the large variety of
			 schemes for character encoding invented throughout the computer age, a more
			 formal description of the encoding process is useful. The process of defining a
			 text encoding can be described as follows (see Unicode Technical Report #17:
Character Encoding Model <bibref ref="UTR17"/> for a more
			 detailed description):
			 <olist>
				<item id="def-repertoire">
				  <p>A set of characters to be encoded is identified. The
					 characters are pragmatically chosen to express text and to efficiently allow
					 various text processes in one or more target languages. They may not correspond
					 precisely to what users perceive as letters and other characters. The set of
					 characters is called a <term>repertoire</term>.</p>
				</item>
				<item id="def-CCS">
				  <p>Each character in the repertoire is then associated with a
					 (mathematical, abstract) non-negative integer, the <term>code point</term>
					 (also known as a <term>character number</term> or <term>code position</term>).
					 The result, a mapping from the repertoire to the set of non-negative integers,
					 is called a <term>coded character set (CCS)</term>.</p>
				</item>
				<item id="def-CEF">
				  <p>To enable use in computers, a suitable base datatype is
					 identified (such as a byte, a 16-bit unit of storage or other) and a
					 <term>character encoding form (CEF)</term> is used, which encodes the abstract
					 integers of a <acronym title="Coded Character Set">CCS</acronym> into sequences
					 of the <term>code units</term> of the base datatype. The encoding form can be
					 extremely simple (for instance, one which encodes the integers of the
					 <acronym title="Coded Character Set">CCS</acronym> into the natural
					 representation of integers of the chosen datatype of the computing platform) or
					 arbitrarily complex (a variable number of code units, where the value of each
					 unit is a non-trivial function of the encoded integer). </p>
				</item>
				<item id="def-CES">
				  <p>To enable transmission or storage using byte-oriented devices,
					 a <term>serialization scheme</term> or <term>character encoding scheme
					 (CES)</term> is next used. A <acronym title="Character Encoding Scheme">CES</acronym> is a mapping of the code units
					 of a <acronym title="Character Encoding Form">CEF</acronym> into well-defined
					 sequences of bytes, taking into account the necessary specification of
					 byte-order for multi-byte base datatypes and including in some cases switching
					 schemes between the code units of multiple
					 <acronym title="Character Encoding Scheme">CES</acronym>es (an example is ISO
					 2022). A <acronym title="Character Encoding Scheme">CES</acronym>, together
					 with the <acronym title="Coded Character Set">CCS</acronym>es it is used with,
					 is identified by an <acronym title="Internet Assigned Numbers Authority">IANA</acronym> charset identifier.
					 Given a sequence of bytes representing text and a <kw>charset</kw> identifier,
					 one can in principle unambiguously recover the sequence of characters of the
					 text.</p>
				</item>
			 </olist></p>
      <note><p>See <specref ref="sec-EncodingIdent"/> for a discussion of the term <qterm>charset</qterm>.</p></note><note>
			 <p>The term <qterm>character encoding</qterm> is somewhat ambiguous,
				as it is sometimes used to describe the actual process of encoding characters
				and sometimes to denote a particular way to perform that process (as in
				<quote>this file is in the X character encoding</quote>). Context normally
				allows the distinction of those uses, once one is aware of the ambiguity.</p>
		  </note>

		  <p>In very simple cases, the whole encoding process can be collapsed to
			 a single step, a trivial one-to-one mapping from characters to bytes; this is
			 the case, for instance, for US-ASCII <bibref ref="MIME"/> and ISO-8859-1.</p>
		  <p id="Unicode_Encoding_Form">Text is said to be in a
			 <term>Unicode encoding form</term> if it is encoded in UTF-8, UTF-16 or
			 UTF-32.</p>
		</div2>
		<div2 id="sec-Transcoding"><head>Transcoding</head>
		  <p id="def-transcoding"><term>Transcoding</term> is the process of converting text from
			 one
			 <termref def="def-CEF">Character Encoding Form</termref> to another.
			 Transcoders work only at the level of character encoding and do not parse the
			 text; consequently, they do not deal with <termref def="sec-Escaping">character escapes</termref> such as numeric
			 character references (see <specref ref="sec-Escaping"/>) and do not adjust
			 embedded character encoding information (for instance in an XML declaration or
			 in an HTML <el>meta</el> element).</p>
		  <note>
			 <p>Transcoding may involve one-to-one, many-to-one, one-to-many or
				many-to-many mappings. In addition, the storage order of characters varies
				between encodings: some, such as Unicode, prescribe logical ordering while
				others use visual ordering; among encodings that have separate diacritics, some
				prescribe that they be placed before the base character, some after. Because of
				these differences in sequencing characters, transcoding may involve reordering:
				thus XYZ may map to yxz.</p>
		  </note>
		  <p id="def-normalizing-transcoder">A <term>normalizing
			 transcoder</term> is a transcoder that converts from a legacy encoding to a
			 <termref def="Unicode_Encoding_Form">Unicode encoding form</termref> <emph>and</emph> ensures that the result is in Unicode
       Normalization Form C (see <specref ref="sec-UnicodeNormalized"/>). For most
			 legacy encodings, it is possible to construct a normalizing transcoder; it is
			 not possible to do so if the encoding's <termref def="def-repertoire">repertoire</termref> contains characters not in
			 Unicode.</p>
		</div2>
		<div2 id="sec-Strings"><head>Strings</head>
		  <p>Various specifications use the notion of a <qterm>string</qterm>,
			 sometimes without defining precisely what is meant and sometimes defining it
			 differently from other specifications. The reason for this variability is that
			 there are in fact multiple reasonable definitions for a string, depending on
			 one's intended use of the notion; the term <qterm>string</qterm> is used for
			 all these different notions because these are actually just different views of
			 the same reality: a piece of text stored inside a computer. This section
			 provides specific definitions for different notions of <qterm>string</qterm>
			 which may be reused elsewhere.</p>
		  <p id="def-byte-string"><term>Byte string</term>: A string viewed as a
			 sequence of bytes representing characters in a particular encoding. This
			 corresponds to a <termref def="def-CES">CES</termref>. As a definition for a
			 string, this definition is most often useless, except when the textual nature
			 is unimportant and the string is considered only as a piece of opaque data with
			 a length in bytes. <req><req-type>S</req-type><req-text>Specifications in
			 general <rfc2119>SHOULD NOT</rfc2119> define a string as a <qterm>byte
			 string</qterm>.</req-text></req> </p>
		  <p id="def-physical-string"><term>Code unit string</term>: A string
			 viewed as a sequence of <termref def="def-CEF">code units</termref> representing characters in a particular
			 encoding. This corresponds to a <termref def="def-CEF">CEF</termref>. This
			 definition is useful in APIs that expose a physical representation of string
			 data. Example: For the DOM <bibref ref="dom1"/>, UTF-16 was chosen based on
			 widespread implementation practice.</p>
		  <p id="def-character-string"><term>Character string</term>: A string
			 viewed as a sequence of characters, each represented by a <termref def="def-CCS">code point</termref> in Unicode
			 <bibref ref="unicode"/>. This is usually what programmers consider to be a
			 string, although it may not match exactly what most users perceive as
			 characters. This is the highest layer of abstraction that ensures
			 interoperability with very low implementation effort.
			 <req><req-type>S</req-type><req-text>The <qterm>character string</qterm>
			 definition of a string is generally the most useful and
			 <rfc2119>SHOULD</rfc2119> be used by most specifications, following the
			 examples of Production [2] of XML 1.0 <bibref ref="xml10"/>, the SGML
			 declaration of HTML 4.0 <bibref ref="html401"/>, and the character model of RFC
			 2070 <bibref ref="rfc2070"/>. </req-text></req></p>

<example><p>Consider the string <image><graphic source="images/surrogateDiffQcaron.gif" width="66" height="26"/><alt>Chinese character for 'stump of tree' does not equal Latin small letter q with combining caron</alt></image> comprising the characters U+233B4 (a Chinese character meaning 'stump
of tree'), U+2260 <uname>NOT EQUAL TO</uname>, U+0071
<uname>LATIN SMALL LETTER Q</uname> and U+030C <uname>COMBINING CARON</uname>,
encoded in UTF-16 in big-endian byte order. The rows of the following table show the
string viewed as a character string, code unit string and byte string, respectively:</p>

<figure>
 <table border="1" cellpadding="5" cellspacing="0" summary="table displaying a string viewed as characters, code units and bytes">
  <tbody>
   <tr align="center">
    <th align="right">Character string</th>
    <td colspan="4">U+233B4 <image><graphic source="images/chineseSurrogate.gif" width="24" height="25"/><alt>Archaic Chinese character meaning "the stump of a tree" (still in current use in Cantonese)</alt></image></td>
    <td colspan="2">U+2260 <image><graphic source="images/not_equal.gif" width="25" height="26"/><alt>NOT EQUAL TO</alt></image></td>
    <td colspan="2">U+0071 <image><graphic source="images/Q.gif" width="14" height="21"/><alt>LATIN SMALL LETTER Q</alt></image></td>
    <td colspan="2">U+030C <image><graphic source="images/caron.gif" width="14" height="21"/><alt>COMBINING CARON</alt></image></td>
   </tr>
   <tr align="center">
    <th align="right">Code unit string</th>
    <td colspan="2">D84C</td>
    <td colspan="2">DFB4</td>
    <td colspan="2">2260</td>
    <td colspan="2">0071</td>
    <td colspan="2">030C</td>
   </tr>
   <tr align="center">
    <th align="right">Byte string</th>
    <td>D8</td>
    <td>4C</td>
    <td>DF</td>
    <td>B4</td>
    <td>22</td>
    <td>60</td>
    <td>00</td>
    <td>71</td>
    <td>03</td>
    <td>0C</td>
   </tr>
  </tbody>
 </table>
</figure>
</example>

<note id="def-grapheme-string">
<p>It is also possible to view a string as a sequence of <term>graphemes</term>. In this case the string
is divided into text units that correspond to the user's perception of where character boundaries occur in
a visually rendered text. However, there is no standard rule for the segmentation of text in this way, and
the segmentation will vary from language to language and even from user to user. Examples of possible
approaches can be found in sections 5.12 and 5.15 of the Unicode Standard, Version 3 <bibref ref="unicode30"/>. </p>
</note>

		</div2>
		<div2 id="sec-RefProcModel"><head>Reference Processing Model</head>
		  <p id="def-char-data">Many Internet protocols and data formats, most notably the very
			 important Web formats HTML, CSS and XML, are based on text. In those formats,
			 everything is text but the relevant specifications impose a structure on the
			 text, giving meaning to certain constructs so as to obtain functionality in
			 addition to that provided by <term>plain text</term> (text where no markup or programming language applies). HTML and XML are <term>markup
			 languages</term>, defining entities entirely composed of text but with
			 conventions allowing the separation of this text into <term>markup</term> and
			 <term>character data</term>. Citing from the XML 1.0 specification
			 <bibref ref="xml10"/>,
			 <xspecref href="http://www.w3.org/TR/2000/REC-xml-20001006#syntax">section
			 2.4</xspecref>:</p>
		  <p><quote>Text consists of intermingled character data and markup.
			 [...] All text that is not markup constitutes the character data
			 of the document.</quote></p>
		  <p>For the purposes of this section, the important aspect is that
			 everything is text, that is, a sequence of characters.</p>
		  <p id="def-ref-proc-model">Since its early days, the Web has seen the development of a
			 <term>Reference Processing Model</term>, first described for HTML in RFC 2070
			 <bibref ref="rfc2070"/>. This model was later embraced by XML and CSS. It is
			 applicable to any data format or protocol that is text-based as described
			 above. The essence of the Reference Processing Model is the use of Unicode as a
			 common reference. Use of the Reference Processing Model by a specification does
			 not, however, require that implementations actually use Unicode. The
			 requirement is only that the implementations behave as if the processing took
			 place as described by the Model.</p>
		  <p>A specification conforms to the Reference Processing Model if all of
			 the following apply:</p>
		  <ulist>
			 <item>
				<p><req><req-type>S </req-type><req-text>Specifications
				  <rfc2119>MUST</rfc2119> be defined in terms of Unicode characters, not bytes or
				  <termref def="def-glyph">glyphs</termref>.</req-text></req></p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>SHOULD</rfc2119> allow the use of the full range of Unicode <termref def="def-CCS">code
				  points</termref> from U+0000 to U+10FFFF inclusive; code points above U+10FFFF <rfc2119>MUST NOT</rfc2119> be used.</req-text></req></p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>MAY</rfc2119> allow use of any character encoding which can be
				  transcoded to Unicode for its text entities.</req-text></req></p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>MAY</rfc2119> choose to disallow or deprecate some encodings and to
				  make others mandatory. Independent of the actual encoding, the specified
				  behavior <rfc2119>MUST</rfc2119> be the same <emph>as if</emph> the processing
				  happened as follows:</req-text>
				  <ulist>
					 <item>
						<p>The encoding of any text entity received by the
						  application implementing the specification <rfc2119>MUST</rfc2119> be
						  determined and the text entity <rfc2119>MUST</rfc2119> be interpreted as a
						  sequence of Unicode characters - this <rfc2119>MUST</rfc2119> be equivalent to
						  <termref def="def-transcoding">transcoding</termref> the entity to some <termref def="Unicode_Encoding_Form">Unicode encoding form</termref>, adjusting any character
						  encoding label if necessary, and receiving it in that Unicode encoding
						  form.</p>
					 </item>
					 <item>
						<p>All processing <rfc2119>MUST</rfc2119> take place on this
						  sequence of Unicode characters.</p>
					 </item>
					 <item>
						<p>If text is output by the application, the sequence of
						  Unicode characters <rfc2119>MUST</rfc2119> be encoded using an encoding chosen
						  among those allowed by the specification.</p>
					 </item>
				  </ulist></req></p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>If a specification is such
				  that multiple text entities are involved (such as an XML document referring to
				  external parsed entities), it <rfc2119>MAY</rfc2119> choose to allow these
				  entities to be in different character encodings. In all cases, the
				  <termref def="sec-RefProcModel">Reference Processing Model</termref>
				  <rfc2119>MUST</rfc2119> be applied to all entities.</req-text></req></p>
			 </item>
		  </ulist>
		  <p><req><req-type>S</req-type><req-text>All specifications that involve
			 text <rfc2119>MUST</rfc2119> specify processing according to the
			 <termref def="sec-RefProcModel">Reference Processing
			 Model</termref>.</req-text></req></p>
		  <note>
			 <p>All specifications that derive from the XML 1.0 specification
				<bibref ref="xml10"/> automatically inherit this Reference Processing Model.
				XML is entirely defined in terms of Unicode characters and mandates the UTF-8
				and UTF-16 encodings while allowing any other encoding for parsed entities.</p>

		  </note>
		  <note>
			 <p>When specifications choose to allow encodings other than Unicode
				encodings, implementers should be aware that the correspondence between the
				characters of a legacy encoding and Unicode characters may in practice depend
				on the software used for <termref def="def-transcoding">transcoding</termref>. See the Japanese XML Profile
				<bibref ref="XML_Japanese_profile"/> for examples of such inconsistencies.</p>
		  </note>
		</div2>
		<div2 id="sec-Encodings"><head>Choice and Identification of Character
			 Encodings</head>
		  <p>Because encoded text <emph>cannot</emph> be interpreted and
			 processed without knowing the encoding, it is vitally important that the
			 character encoding   (see <specref ref="sec-Digital"/>) is known at all
			 times and places where text is exchanged or processed.
			   In what follows we use <qterm>character encoding</qterm> to mean either <termref def="def-CEF">CEF</termref> or <termref def="def-CES">CES</termref> depending on the context. When text transmitted as a byte stream is
involved, for instance in a protocol, specification of a CES is required
to ensure proper interpretation; in contexts such as an API, where the
environment (typically the processor architecture) specifies the byte
order of multibyte quantities, specification of a CEF suffices. <req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST</rfc2119>
			 either specify a unique encoding, or provide character encoding identification
			 mechanisms such that the encoding of text can always be reliably
			 identified.</req-text></req> <req><req-type>S</req-type><req-text>When
			 designing a new protocol, format or API, specifications
			 <rfc2119>SHOULD</rfc2119> mandate a unique character
			 encoding.</req-text></req></p>
      <div3 id="sec-UniqueEncoding"><head>Mandating a unique character
				encoding</head>
			 <p>Mandating a unique character encoding is simple, efficient, and
				robust. There is no need for specifying, producing, transmitting, and
				interpreting encoding tags. At the receiver, the encoding will always be
				understood. There is also no ambiguity  as to which encoding to use if data is transferred
				non-electronically and later has to be converted back to a digital
				representation. Even when there is a need for compatibility with existing data,
				systems, protocols and applications, multiple encodings can often be dealt with
				at the boundaries or outside a protocol, format, or API. The
				<acronym title="Document Object Model">DOM</acronym> <bibref ref="dom1"/> is an
				example of where this was done. The advantages of choosing a unique encoding
				become more important the smaller the pieces of text used are and the closer to
				actual processing the specification is.</p>
			 <p><req><req-type>S</req-type><req-text>When a unique encoding is
				mandated, the encoding <rfc2119>MUST</rfc2119> be UTF-8, UTF-16 or
				UTF-32.</req-text></req> <req><req-type>S</req-type><req-text>If a unique
				encoding is mandated and compatibility with US-ASCII is desired, UTF-8 (see
				<bibref ref="rfc2279"/>) is <rfc2119>RECOMMENDED</rfc2119>.</req-text></req> In
				other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate.
				Possible reasons for choosing one of these include efficiency of internal
				processing and interoperability with other processes.</p>
			 <note>
				<p>The IETF Charset Policy <bibref ref="rfc2277"/> specifies that
				  on the Internet <quote>Protocols MUST be able to use the UTF-8
				  charset</quote>.</p>
			 </note>
			 <note>
				<p>The XML 1.0 specification <bibref ref="xml10"/> requires all
				  conforming XML processors to accept both UTF-16 and UTF-8.</p>
			 </note>
		  </div3>
      <div3 id="sec-EncodingIdent"><head>Character encoding
				identification</head>
			 <p>The MIME Internet specification <bibref ref="MIME"/> provides a
				good example of a mechanism for character encoding identification. The MIME
				<kw>charset</kw> parameter definition is intended to supply sufficient
				information to uniquely decode the sequence of bytes of the received data into
				a sequence of characters. The values are drawn from the IANA charset registry
				<bibref ref="iana"/>.</p>
			 <note>
				<p>In practice there is wide variation among implementations, so
				  uniqueness cannot be depended upon. See the end of
				  <specref ref="sec-RefProcModel"/> for more information.</p>
			 </note>
			 <note id="def-charset">
				<p>The term <term>charset</term> derives from <qterm>character
				  set</qterm>, an expression with a long and tortured history (see
				  <bibref ref="connolly"/> for a discussion).</p>
			 </note>
			 <p><req><req-type>S</req-type><req-text>Specifications
				<rfc2119>SHOULD</rfc2119> avoid using the terms <qterm>character
				set</qterm> and <qterm>charset</qterm> to refer to a character
				encoding, except when the latter is used to refer to the MIME
				<kw>charset</kw> parameter or its IANA-registered values. The terms
				<qterm>character encoding</qterm>, <qterm>character encoding form</qterm> or <qterm>character encoding scheme</qterm>
				are <rfc2119>RECOMMENDED</rfc2119>.</req-text></req></p>
			 <note>
				<p>In XML, the XML declaration or the text declaration contains a
				  pseudo-attribute called <kw>encoding</kw> which identifies the character
				  encoding using the IANA charset.</p>
			 </note>
			 <note>
				<p>Unfortunately, some charset identifiers do not represent a single,
unique encoding scheme.
Instead, these identifiers denote a number of slight variations of an
encoding scheme.
Even though slight, the differences may be crucial and may vary over
time.
For these identifiers, recovery of the character sequence from a byte
sequence is ambiguous.
For example, the character encoded as 0x5C in the Shift-JIS encoding scheme is ambiguous.
The character sometimes represents a <uname>YEN SIGN</uname> and sometimes represents a
<uname>REVERSE SOLIDUS</uname>.
See the <bibref ref="XML_Japanese_profile"/> for more detail on this example and for
additional examples of such ambiguous charset identifiers.
</p>
			 </note><p>The IANA charset registry is the official list of names and
				aliases for character encodings on the Internet.</p>
			 <p><req><req-type>S</req-type><req-text>If the unique encoding
				approach is not taken, specifications <rfc2119>SHOULD</rfc2119> mandate the use
				of the IANA charset registry names, and in particular the names identified in
				the registry as <qterm>MIME preferred names</qterm>, to designate character
				encodings in protocols, data formats and APIs.</req-text></req>
				<req><req-type>S</req-type><req-text>The <qterm>x-</qterm> convention for
				unregistered character encoding names <rfc2119>SHOULD NOT</rfc2119> be used,
				having led to abuse in the past.</req-text></req> (<qterm>x-</qterm> was used
				for character encodings that were widely used, even long after there was an
				official registration.)
				<req><req-type>I</req-type><req-type>C</req-type><req-text>Content and software
				that label text data <rfc2119>MUST</rfc2119> use one of the names mandated
				by the appropriate specification (e.g. the XML specification when editing XML
				text) and <rfc2119>SHOULD</rfc2119> use the MIME preferred name of an encoding
				to label data in that encoding.</req-text></req>
				<req><req-type>I</req-type><req-type>C</req-type><req-text>An IANA-registered
				<kw>charset</kw> name <rfc2119>MUST NOT</rfc2119> be used to label text data
				in an encoding other than the one identified in the IANA registration of that
				name.</req-text></req></p>
			 <p><req><req-type>S</req-type><req-text>If the unique encoding
				approach is not chosen, specifications <rfc2119>MUST</rfc2119> designate at
				least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible
				encodings and <rfc2119>SHOULD</rfc2119> choose at least one of UTF-8 or UTF-16
				as mandated encoding forms (encoding forms that <rfc2119>MUST</rfc2119> be
				supported by implementations of the specification).</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>MAY</rfc2119>
				define either UTF-8 or UTF-16 as a default encoding form (or both if they
				define suitable means of distinguishing them), but they <rfc2119>MUST
				NOT</rfc2119> use any other character encoding as a default.</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST NOT</rfc2119>
				propose the use of heuristics to determine the encoding of data.</req-text></req></p>
			 <p><req><req-type>I</req-type><req-text><emph>Receiving</emph>
				software <rfc2119>MUST</rfc2119> determine the encoding of data from available
				information according to appropriate specifications.</req-text></req>
				<req><req-type>I</req-type><req-text>When an IANA-registered <kw>charset</kw>
				name is recognized, receiving software <rfc2119>MUST</rfc2119> interpret the
				received data according to the encoding associated with the name in the IANA
				registry.</req-text></req> <req><req-type>I</req-type><req-text>When no charset
				is provided receiving software <rfc2119>MUST</rfc2119> adhere to the default
				encoding(s) specified in the specification.</req-text></req></p>
			 <p><req><req-type>I</req-type><req-text>Receiving software
				<rfc2119>MAY</rfc2119> recognize as many encodings (names and aliases) as
				appropriate.</req-text></req> A field-upgradeable mechanism may be appropriate
				for this purpose. Certain encodings are more or less associated with certain
				languages (e.g. Shift-JIS with Japanese); trying to support a given language or
				set of customers may mean that certain encodings have to be supported. The
				encodings that need to be supported may change over time. This document does
				not give any advice on which encoding may be appropriate or necessary for the
				support of any given language.</p>
			 <p><req><req-type>I</req-type><req-text>Software
				<rfc2119>MUST</rfc2119> completely implement the mechanisms for character
				encoding identification and <rfc2119>SHOULD</rfc2119> implement them in such a
				way that they are easy to use (for instance in HTTP servers).</req-text></req>
				<req><req-type>I</req-type><req-text>On interfaces to other protocols, software
				<rfc2119>SHOULD</rfc2119> support conversion between <termref def="Unicode_Encoding_Form">Unicode encoding forms</termref> as
				well as any other necessary conversions.</req-text></req></p>
			 <p><req><req-type>C</req-type><req-text>Content
				<rfc2119>MUST</rfc2119> make use of available facilities for character encoding
				identification by always indicating character encoding; where the facilities
				offered for character encoding identification include defaults (e.g. in XML 1.0
				<bibref ref="xml10"/>), relying on such defaults is sufficient to satisfy this
				identification requirement.</req-text></req></p>
			 <p>Because of the layered Web architecture (e.g. formats used over
				protocols), there may be multiple and at times conflicting information about
				character encoding. <req><req-type>S</req-type><req-text>Specifications
				<rfc2119>MUST</rfc2119> define conflict-resolution mechanisms (e.g. priorities)
				for cases where there is multiple or conflicting information about character
				encoding.</req-text></req>
				<req><req-type>I</req-type><req-type>C</req-type><req-text>Software and content
				<rfc2119>MUST</rfc2119> carefully follow conflict-resolution mechanisms where
				there is multiple or conflicting information about character
				encoding.</req-text></req></p>
		  </div3>
      <div3 id="sec-PrivateUse"><head>Private use code points</head>
			 <p>Unicode designates certain ranges of <termref def="def-CCS">code points</termref> for private use:
				the Private Use Area (U+E000-F8FF) and planes 15 and 16 (U+F0000-FFFFD and
				U+100000-10FFFD). These code points are guaranteed to never be allocated to
				standard characters, and are available for use by private agreement between a
				<termref def="def-recipient-producer">producer</termref> and a <termref def="def-recipient-producer">recipient</termref>. However, their use is strongly discouraged, since
				private agreements do not scale on the Web. Code points from different private
				agreements may collide, and a private agreement and therefore the meaning of
				the code points can quickly get lost.</p>
			 <p><req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST
				NOT</rfc2119> define any assignments of private use code
				points.</req-text></req> <req><req-type>S</req-type><req-text>Conformance to a
				specification <rfc2119>MUST NOT</rfc2119> require the use of private use area
				characters.</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>SHOULD
				NOT</rfc2119> provide mechanisms for agreement on private use code points
				between parties and <rfc2119>MUST NOT</rfc2119> require the use of such
				mechanisms.</req-text></req>
				<req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications and
				implementations <rfc2119>SHOULD</rfc2119> be designed in such a way as to not
				disallow the use of private use code points by private
				arrangement.</req-text></req> As an example, XML does not disallow the use of
				private use code points.</p>
			 <p><req><req-type>S</req-type><req-text>Specifications
				<rfc2119>MAY</rfc2119> define <termref def="def-char-data">markup</termref> to allow the transmission of symbols not
				in Unicode or to identify specific variants of Unicode
				characters.</req-text></req></p><example><p>MathML (see <bibref ref="mathml2"/>
			 <xspecref href="http://www.w3.org/TR/2001/PR-MathML2-20010108/chapter3.html#presm_mglyph">section
			 3.2.9</xspecref>) defines an element <el>mglyph</el> for mathematical symbols
			 not in Unicode.</p></example><example><p>SVG (see <bibref ref="svg"/>
			 <xspecref href="http://www.w3.org/TR/2000/CR-SVG-20001102/text.html#AlternateGlyphs">section
			 10.14</xspecref>) defines an element <el>altglyph</el> which allows the
			 identification of specific display variants of Unicode characters.</p></example>
		  </div3>
		</div2>
		<div2 id="sec-Escaping"><head>Character Escaping</head>
		  

<p id="def-syntax-significant">Markup or programming languages often designate certain characters as
<term>syntax-significant</term>, giving them specific functions within the language (e.g.
<qchar>&lt;</qchar> and <qchar>&amp;</qchar> serve as markup delimiters in HTML
and XML). As a consequence, these syntax-significant characters cannot be used to
represent themselves in text in the same way as all other characters do, creating the 
need for a mechanism to <quote>escape</quote> their syntax-significance. There is also a need,
often satisfied by the same or similar mechanisms, to express characters not directly
representable in the character encoding of instances of the language. The commonality
among escaping mechanisms is that they express characters at the level
of a language's syntax, which is itself expressed as characters represented at the
character encoding level.</p>

<p id="def-char-escape">Formally, a <term id="def-char-esc">character escape</term> is a syntactic device defined in a
markup or programming language that allows one or more of:

<olist>
 <item><p>expressing syntax-significant characters while disregarding their significance
  in the syntax of the language, or</p></item>
 <item><p>expressing characters not representable in the character encoding of
   an instance of the language, or</p></item><item><p>expressing characters in general, without use of the corresponding
   character codes.</p></item>
</olist>
</p>

<p>Escaping a character means expressing it using such a construct, appropriate to the
format or protocol in which the character appears; <term>expanding a character escape</term> (or
<term>unescaping</term>) means replacing it with the character that it represents.</p>

<example><p>HTML and XML define <qterm>Numeric Character References</qterm> which allow
both the escaping of syntax-significance and the expression of arbitrary characters. Expressed
as &amp;#x3C; or &amp;#60; the character <qchar>&lt;</qchar> will not be parsed as
a markup delimiter.</p></example>

<example><p>The programming language Java uses <qchar>"</qchar> to delimit strings.
To express <qchar>"</qchar> within a string, one may escape it as <qchar>\"</qchar>.</p></example>

<example><p>XML defines <qterm>CDATA sections</qterm> which allow escaping the
syntax-significance of all characters between the CDATA section delimiters. CDATA sections
do not allow the expression of unrepresentable characters and in fact prevent their
expression using numeric character references.</p></example>

          <p>Certain guidelines apply to the way specifications define character
			 escapes; they are addressed in this section. In addition, character
			 escapes have an impact on character normalization, to be addressed in
			 <specref ref="sec-IncludeNormalized"/>.</p>
			 
			 <p><req><req-type>S</req-type><req-text>The guidelines in this document
			 relating to the
			 <loc href="#sec-Escaping">definition of character escapes</loc>
			 <rfc2119>MUST</rfc2119> be followed when designing new W3C protocols and
			 formats and <rfc2119>SHOULD</rfc2119> be followed as much as possible when
			 revising existing protocols and formats.</req-text></req></p>
		  <ulist>
			 <item>
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>MUST NOT</rfc2119> invent a new escaping mechanism if an appropriate
				  one already exists. </req-text></req></p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>The number of different
				  ways to escape a character <rfc2119>SHOULD</rfc2119> be minimized (ideally to
				  one).</req-text></req> [A well-known counter-example is that for historical
				  reasons, both HTML and XML have redundant decimal (&amp;#ddddd;) and
				  hexadecimal (&amp;#xhhhh;) character escapes.]</p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>Escape syntax <rfc2119>SHOULD</rfc2119> either 
				  require explicit end delimiters or mandate a fixed number of characters in each character escape. 
				  Escape syntaxes where the end is determined by a character outside the set of characters
				  admissible in the character escape itself 
				  <rfc2119>SHOULD</rfc2119> be avoided.</req-text></req> These character escapes are not
				  clear visually, and can cause an editor to insert spurious line-breaks when
				  word-wrapping on spaces. Forms like SPREAD's &amp;UABCD; <bibref ref="spread"/>
				  or XML's &amp;#xhhhh;, where the character escape is explicitly terminated by a
				  semicolon, are much better. </p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>Whenever specifications
				  define character escapes that allow the representation of characters using a number the
				  number <rfc2119>SHOULD</rfc2119> be in hexadecimal
				  notation.</req-text></req></p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-text>Escaped characters
				  <rfc2119>SHOULD</rfc2119> be acceptable wherever unescaped characters are; this
				  does not preclude that a <termref def="def-syntax-significant">syntax-significant</termref> character, when escaped, loses its
				  significance in the syntax. In particular, escaped characters
				  <rfc2119>SHOULD</rfc2119> be acceptable in identifiers and
				  comments.</req-text></req></p>
			 </item>
		  </ulist>

		  <p>Certain guidelines apply to content developers, as well as to
			 software that generates content:</p>
		  <ulist>
			 <item>
				<p><req><req-type>I</req-type><req-type>C</req-type><req-text>Escapes
				  <rfc2119>SHOULD</rfc2119> be avoided when the characters to be expressed are
				  representable in the character encoding of the document.</req-text></req></p>
			 </item>
			 <item>
				<p><req><req-type>I</req-type><req-type>C</req-type><req-text>Since
				  character set standards usually list character numbers as hexadecimal, content
				  <rfc2119>SHOULD</rfc2119> use the hexadecimal form of character escapes when there is
				  one.</req-text></req></p>
			 </item>
       <item>
        <p><req><req-type>I</req-type><req-type>C</req-type><req-text>Choose an encoding
         for the document that maximizes the opportunity to directly represent characters
         and minimizes the need to represent characters by <termref def="def-char-data">markup</termref> means such as character escapes.
         In general, if the first encoding choice is not satisfactory, Unicode is the next
         best choice, for its large character repertoire and its wide base of
         support.</req-text></req></p>
       </item>
		  </ulist>
		</div2>
	 </div1>
	 <div1 id="sec-Normalization"><head>Early Uniform Normalization</head>
		<p>This chapter discusses text normalization for the Web.
		  <specref ref="sec-NormalizationMotivation"/> discusses the need for
		  normalization, and in particular early uniform normalization.
		  <specref ref="sec-TextNormalization"/> defines the various types of normalization and <specref ref="sec-NormalizationExamples"/> gives
		  supporting examples. <specref ref="sec-NormalizationApplication"/> assigns reponsibilities
		  to various components and situations. The requirements for early uniform normalization are discussed in <titleref>Requirements for String Identity Matching</titleref> <bibref ref="CharReq"/>, <loc href="http://www.w3.org/TR/WD-charreq#3">section 3</loc>.</p>

<div2 id="sec-NormalizationMotivation"><head>Motivation</head>

<div3 id="sec-WhyNormalization"><head>Why do we need character normalization?</head><p>Text in computers can be encoded in one of many encodings. In addition, some encodings allow multiple representations for the <qterm>same</qterm> string and Web languages have escape mechanisms that introduce even more equivalent representations. For instance, in ISO 8859-1 the letter <qchar>ç</qchar>
can only be represented as the single character E7 <qchar>ç</qchar>, in a Unicode encoding it can be represented as the single character U+00E7 <qchar>ç</qchar> <emph>or</emph> the sequence U+0063 <qchar>c</qchar> U+0327
<qchar>¸</qchar>, and in HTML it could be additionally represented as <code>&amp;ccedil;</code> or <code>&amp;#xE7;</code> or <code>&amp;#231;</code>.</p><p>There are a number of fundamental operations that are sensitive to these multiple representations: string  matching, indexing, searching, sorting, regular
expression matching, selection, etc. In particular, the proper functioning of the Web (and of much other software) depends to a large extent on
 string  matching.  Examples of string
 matching abound: parsing element and attribute names in Web documents, matching CSS selectors to
the nodes in a document, matching font names in a style sheet to the names known to the operating system, matching
URI pieces to the resources in a server, matching strings embedded in an ECMAscript program to strings typed in by
a Web form user, matching the parts of an XPath expression (element names, attribute names and values, content, etc.)
to what is found in an instance, etc.</p>

<p>String
matching is usually taken for granted and performed by comparing two strings byte for byte, but the
existence on the Web of multiple character  representations means that it is actually non-trivial. Binary comparison
<emph>does not work</emph> if the
strings are not in the same encoding (e.g. an EBCDIC style sheet being directly applied to an ASCII document, or a font
specification in a Shift-JIS style sheet directly used on a system that maintains font names in UTF-16) or if
they are in the same encoding but show variations allowed for the <qterm>same</qterm> string by the use of combining characters or by the constructs of the Web language.</p>

<p>Incorrect string matching can have far reaching
			 consequences, including the creation of security holes. Consider a contract,
			 encoded in XML, for buying goods: each item sold is described in an
			 <el>artículo</el> element; unfortunately, <quote>artículo</quote> is subject to
			 different representations in the character encoding of the contract. Suppose
			 that the contract is viewed and signed by means of a user agent that looks for
			 <el>artículo</el> elements, extracts them (matching on the element name),
			 presents them to the user and adds up their prices. If different instances of
			 the <el>artículo</el> element happen to be represented differently in a
			 particular contract, then the buyer and seller may see (and sign) different
			 contracts if their respective user agents perform string identity matching
			 differently, which is fairly likely in the absence of a well-defined
			 specification for string matching. The absence of a well-defined specification would also mean that
			 there would be no way to resolve the ensuing contractual dispute.</p>

<p>Solving the string matching problem involves normalization, which in a nutshell means bringing the two strings
to be compared to a common, canonical encoding prior to performing binary matching. (For additional steps involved in string matching see <specref ref="sec-IdentityMatching"/>.)</p>
</div3>

<div3 id="sec-EarlyUniformNormalization"><head>The choice of early uniform normalization</head>

<p>There are options in the exact way normalization can be used to achieve correct behaviour of
normalization-sensitive operations such as string matching. These options lie along two axes:</p><p>The
first axis is a choice of <emph>when</emph> normalization occurs: early (when strings are created) or late
(when strings are compared).  The former amounts to establishing a canonical encoding for all data that is
transmitted or stored, so that it doesn't need any normalization later, before being used. The latter is
the equivalent of mandating <qterm>smart</qterm> compare functions, which will take care of any encoding
differences.</p>

<p>This document specifies <emph>early</emph> normalization. The reasons for that choice are manifold:
 <ulist>
  <item>
    <p>Almost all legacy data as well as data created by current
     software is normalized (if using <termref def="sec-ChoiceNFC">NFC</termref>).</p>
  </item>
  <item>
    <p>The number of Web components that generate or transform text
     is considerably smaller than the number of components that receive text and
     need to perform matching or other processes requiring normalized text.</p>
  </item>
  <item>
    <p>Current receiving components (browsers, XML parsers, etc.)
     implicitly assume early normalization by not performing or verifying normalization
     themselves. This is a vast legacy.</p>
  </item>
  <item>
    <p>Web components that generate and process text are in a much
     better position to do normalization than other components; in particular, they
     may be aware that they deal with a restricted repertoire only, which simplifies the process of normalization.</p>
  </item>
  <item>
    <p>Not all components of the Web that implement functions such as
     string matching can reasonably be expected to do normalization. This, in
     particular, applies to very small components and components in the lower layers
     of the architecture.</p>
  </item>
  <item>
    <p>Forward-compatibility issues can be dealt with more easily:
     less software needs to be updated, namely only the software that generates
     newly introduced characters.</p>
  </item>
  <item>
    <p>It improves matching in cases where the character encoding is
     partly undefined, such as URIs <bibref ref="rfc2396"/> in which non-ASCII bytes
     have no defined meaning.</p>
  </item>
  <item>
    <p>It is a prerequisite for comparison of encrypted strings (see
     <bibref ref="CharReq"/>,
     <loc href="http://www.w3.org/TR/WD-charreq#2.7">section
      2.7</loc>).</p>
  </item>
 </ulist></p>

<p>The second axis is a choice of canonical encoding. This choice needs only be made if early normalization
is chosen. With late normalization, the canonical encoding would be an internal matter of the smart compare function, which doesn't need any wide agreement or standardization.</p>

<p>By choosing a single canonical encoding, it is insured that
normalization is uniform throughout the web. Hence the two axes lead us
to the name <qterm>early uniform normalization</qterm>.</p>
		  </div3>

<div3 id="sec-ChoiceNFC"><head>The choice of Normalization Form C</head>

<p>The Unicode Consortium provides four standard normalization forms (see <titleref>Unicode Normalization
Forms</titleref> <bibref ref="UTR15"/>).  These forms differ in 1) whether they normalize towards decomposed characters (NFD, NFKD) or precomposed characters (NFC, NFKC) and 2) whether they normalize away compatibility distinctions (NFKD, NFKC) or not (NFD, NFC).</p>

<p>For use on the Web, it is important not to lose the so-called compatibility distinctions, which may be important (see <bibref ref="UXML"/> for a discussion).  The 'K' normalization forms are therefore excluded.  Among the remaining two forms, NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one) as well as data created by current software is already in this form; NFC also has a slight compactness advantage and a better match to user expectations with respect to the character vs <termref def="def-grapheme-string">grapheme</termref> issue.  This document therefore chooses NFC as the base for Web-related text normalization.</p>

<note>
<p>Roughly speaking, <term>NFC</term> is defined such that each combining character sequence (a base character
followed by one or more combining characters) is replaced, as far as possible, by a canonically
equivalent precomposed character. Text in a <termref def="Unicode_Encoding_Form">Unicode encoding form</termref> is said to be in NFC if it doesn't
contain any combining sequence that could be replaced and if any remaining combining sequence is in
canonical order.</p>
</note>
<p>For a list of programming resources related to normalization, see <specref ref="sec-n11n-resources"/>.</p></div3>
</div2>

<div2 id="sec-TextNormalization"><head>Definitions for W3C Text Normalization</head>

<p>For use on the Web, this document defines Web-related text
normalization forms by starting with Unicode Normalization Form C (<termref def="sec-ChoiceNFC">NFC</termref>), and
additionally addressing the issues of legacy encodings, character escapes, includes, and character and
markup boundaries. Examples illustrating these definitions can be found in <specref ref="sec-NormalizationExamples"/>.</p>

<div3 id="sec-UnicodeNormalized"><head>Unicode-normalized text</head>

<p>Text is, for the purposes of this specification, <term>Unicode-normalized</term> if it is in a
<termref def="Unicode_Encoding_Form">Unicode encoding form</termref> <emph>and</emph> is in Unicode
Normalization Form C, according to a version of
Unicode Standard Annex #15: Unicode Normalization Forms <bibref ref="UTR15"/> at least as recent as the oldest version of Unicode that contains 
all the characters actually present in the text, but no earlier than version 3.2 <bibref ref="unicode32"/>.</p>


</div3>

<div3 id="sec-IncludeNormalized"><head>Include-normalized text</head>

<p id="def-include">Markup languages, style languages and programming languages often offer
facilities for including a piece of text inside another. An
<term>include</term> is an instance of a syntactic device specified in a
language to include an <term>entity</term> at the position of the include,
replacing the include itself. Examples of includes are entity references in
XML, @import rules in CSS and the 
#include preprocessor statement in C/C++.
<termref def="sec-Escaping">Character escapes</termref> are a special case of includes where the included entity
is predetermined by the language.</p>

<p>Text is <term>include-normalized</term> if:

<olist>
 <item><p>the text is <termref def="sec-UnicodeNormalized">Unicode-normalized</termref> <emph>and</emph> does not contain any <termref def="sec-Escaping">character escapes</termref> or
  <termref def="def-include">includes</termref> whose expansion would cause the text to become no longer
  Unicode-normalized; or</p></item>
 <item><p>the text is in a legacy encoding <emph>and</emph>, if it were transcoded to a <termref def="Unicode_Encoding_Form">Unicode
  encoding form</termref> by a <termref def="def-normalizing-transcoder">normalizing transcoder</termref>, the
  resulting text would satisfy clause 1 above.</p></item>
</olist>
</p>

<note>
<p>A consequence of this definition is that legacy text (i.e. text in a legacy encoding) is always
include-normalized unless i) a normalizing transcoder cannot exist for that encoding (e.g. because the
repertoire contains characters not in Unicode) or ii) the text contains character escapes or includes which, once
expanded, result in un-normalized text.</p>
</note>

<note>
<p>Include-normalization is specified against
the context of a (computer) language (or the absence thereof), which specifies the form of character escapes and
includes. For plain text (no character escapes or includes) in a Unicode encoding form, include-normalization and
Unicode-normalization are equivalent.</p>
</note>
</div3>

<div3 id="sec-FullyNormalized"><head>Fully-normalized text</head>

<p id="def-construct">Formal languages define <term>constructs</term>, which are identifiable pieces, occuring in
instances of the language, such as comments, identifiers, element tags, processing instructions, runs of
<termref def="def-char-data">character data</termref>, etc. During the normal processing of  <termref def="sec-IncludeNormalized">include-normalized</termref> text, these
various constructs may be moved, removed (e.g. removing comments)
or merged (e.g. merging all the
<termref def="def-char-data">character data</termref> within an element as done by the <code>string()</code> function of XPath), creating
opportunities for text to become denormalized. The software performing those operations then has to
re-normalize the result, which is a burden. One way to avoid such denormalization is to make sure
that the various important constructs never begin with a character such that appending that character to
a normalized string can cause the string to become denormalized.
A <term>composing character</term> is a character that is one or both of
the following:

<olist>
 <item><p>the second character in the canonical decomposition mapping of some primary 
  composite (as defined in <loc href="http://www.unicode.org/unicode/reports/tr15/#D3">D3</loc> of
  <bibref ref="UTR15"/>), or</p></item>
 <item><p>of non-zero canonical combining class (as defined in <bibref ref="unicode"/>).</p></item>
</olist>
</p>

<p>Please
consult Appendix <specref ref="sec-ComposingChars"/> for a discussion of composing characters, which are
not exactly the same as Unicode combining characters.</p>





<p>Text is <term>fully-normalized</term> if:

<olist>
 <item><p>the text is in a <termref def="Unicode_Encoding_Form">Unicode encoding form</termref>, is <termref def="sec-IncludeNormalized">include-normalized</termref>
  and none of the constructs comprising the text begin with a <termref def="def-construct">composing character</termref> or a <termref def="sec-Escaping">character
  escape</termref> representing a composing character; or</p></item>
 <item><p>the text is in a legacy encoding and, if it were transcoded to a <termref def="Unicode_Encoding_Form">Unicode encoding form</termref>
  by a <termref def="def-normalizing-transcoder">normalizing transcoder</termref>, the resulting text would satisfy clause 1 above.</p></item>
</olist>
</p>



<note>
<p>Full-normalization is specified against the context of a (computer) language (or the absence thereof),
which specifies the form of character escapes and <termref def="def-include">includes</termref> and the separation into constructs. For plain text (no
includes, no constructs, no character escapes) in a Unicode encoding form, full-normalization and Unicode-normalization are equivalent.</p>
</note>

<p>Identification of the constructs that should be prohibited from 
beginning with a <termref def="def-construct">composing character</termref> (the <term>relevant 
constructs</term>) is language-dependent. As specified in
<specref ref="sec-NormalizationApplication"/>, it is the responsibility of the
specification for a language to specify exactly what constitutes a relevant construct. This may be 
done by specifying important boundaries, 
taking into account which operations would benefit the most from being protected against 
denormalization. The relevant constructs are  then defined
as the spans of text between the boundaries. At a minimum, for those languages
which have these notions, the important boundaries are entity (include)
boundaries as well as the boundaries between most <termref def="def-char-data">markup</termref> and <termref def="def-char-data">character data</termref>. Many
languages will benefit from defining more boundaries and therefore
finer-grained full-normalization constructs.</p>

<note><p>In general, it will be advisable
<emph>not</emph> to include character escapes designed to express arbitrary characters
among the relevant constructs; the reason is that
including them would prevent the expression of combining sequences using character escapes
(e.g. <qchar>q&amp;#x30C;</qchar> for q-caron), which is especially important in legacy
encodings that lack the desired combining marks.</p></note>

<note>
<p>Full-normalization is closed under concatenation: the concatenation of two fully-normalized strings is also fully-normalized. As a result, a side benefit of
including entity boundaries in the set of boundaries important for full-normalization is that the state of normalization of a document that includes
entities can be assessed <emph>without</emph> expanding the <termref def="def-include">includes</termref>, if the
included entities are known to be fully-normalized. If all the entities are
known to be include-normalized <emph>and</emph> not to start with a <termref def="def-construct">composing
character</termref>, then it can be concluded that including the entities would not
denormalize the document.</p>
</note>


		</div3>


		</div2>
		<div2 id="sec-NormalizationExamples"><head>Examples</head><p>In some of the following examples, <qchar>¸</qchar> is used to depict the
character U+0327 <uname>COMBINING CEDILLA</uname>, for the purposes of illustration.
Had a real U+0327 been used instead of this spacing (non-combining) variant, some browsers might combine it with a preceding <qchar>c</qchar>, resulting
in a display indistinguishable from a U+00E7 <qchar>ç</qchar> and a loss of
understandability of the examples. In addition, if the sequence c + combining cedilla
were present, this document would not be include-normalized and would therefore not conform to itself.</p> 

<p>It is also assumed that the example strings
are relevant constructs for the purposes of full-normalization.</p><div3 id="sec-GeneralExamples"><head>General examples</head>



<p>The string
<code>suçon</code> (U+0073
U+0075 U+00E7 U+006F U+006E) encoded in a Unicode encoding form, is
Unicode-normalized, include-normalized and fully-normalized. The same string encoded in a legacy encoding for which there exists
a normalizing transcoder would be both include-normalized and fully-normalized but not Unicode-normalized (since not in a Unicode encoding
form).</p>

<p>In an XML or HTML context, the string <code>su&amp;#xE7;on</code>
is also include-normalized, fully-normalized and, if
encoded in a Unicode encoding form, Unicode-normalized. Expanding &amp;#xE7;
yields <code>suçon</code> as above, which contains no replaceable combining
sequence.</p>

<p>The string <code>suc¸on</code> (U+0073 U+0075 U+0063 <emph>U+0327</emph> U+006F
U+006E), where U+0327 is the <uname>COMBINING CEDILLA</uname>, encoded in a
Unicode encoding form, is neither Unicode-normalized (since the combining
sequence <qchar>c¸</qchar> (U+0063 U+0327) should be replaced by the precomposed
<qchar>ç</qchar> (U+00E7).  As a consequence this string is neither  include-normalized (since in
a Unicode encoding form but not Unicode-normalized) nor fully-normalized
(since not include-normalized). Note however that the string <code>sub¸on</code> (U+0073 U+0075 <emph>U+0062</emph> U+0327 U+006F
U+006E) in a Unicode encoding <emph>is</emph> Unicode-normalized since there is no precomposed form of <qchar>b</qchar> plus cedilla. It is also include-normalized and fully-normalized.</p>

<p>In plain text the string <code>suc&amp;#x0327;on</code> is Unicode-normalized, since
plain text doesn't recognise that &amp;#x0327; represents a character in XML or HTML and
considers it just a sequence of non-replaceable characters.</p><p>In an XML or
HTML context, however, expanding &amp;#x0327; yields the string  <code>suc¸on</code> (U+0073
U+0075 U+0063 <emph>U+0327</emph> U+006F U+006E) which is not include-normalized
(<qchar>c¸</qchar> is replaceable by <qchar>ç</qchar>). As a consequence the string is neither
include-normalized nor fully-normalized. As another example, if the entity reference <code>&amp;word-end;</code>
refers to an entity containing <code>¸on</code> (U+0327 U+006F U+006E), then the string
<code>suc&amp;word-end;</code> is not include-normalized for the same reasons.</p>

<p>In an XML or HTML context, expanding &amp;#x0327; in the string <code>sub&amp;#x0327;on</code>
yields the string <code>sub¸on</code> which <emph>is</emph> Unicode-normalized since there is no
precomposed character for <qterm>b cedilla</qterm> in NFC.  This string is therefore also
include-normalized.  Similarly, the string <code>sub&amp;word-end;</code> (with <code>&amp;word-end;</code> as
above) is include-normalized, for the same reasons.</p>



<p>In an XML or HTML context, the strings <code>¸on</code>  (U+0327 U+006F
U+006E) and
<code>&amp;#x0327;on</code> are not fully-normalized, as they begin with a
composing character (after expansion of the character escape for the second). However,
both are Unicode-normalized (if expressed in a Unicode encoding) and
include-normalized.</p>

<p>The following table consolidates the above examples.</p>

<figure><table border="1" cellpadding="5" cellspacing="0" summary="Consolidated table of normalization examples">
 <thead>
  <tr><th>String</th><th>Encoding</th><th>Context</th><th>Unicode-normalized</th><th>Include-normalized</th><th>Fully-normalized</th></tr>
 </thead>
 <tbody>
  <tr align="center">
   <td rowspan="4">suçon</td><td rowspan="2">Unicode</td><td>Plain text</td><td>Y</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>Y</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td rowspan="2">Legacy</td><td>Plain text</td><td>N</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>N</td><td>Y</td><td>Y</td>
  </tr>

  <tr align="center">
   <td rowspan="4">su&amp;#xE7;on</td><td rowspan="2">Unicode</td><td>Plain text</td><td>Y</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>Y</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td rowspan="2">Legacy</td><td>Plain text</td><td>N</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>N</td><td>Y</td><td>Y</td>
  </tr>

  <tr align="center">
   <td rowspan="2">suc¸on</td><td rowspan="2">Unicode</td><td>Plain text</td><td>N</td><td>N</td><td>N</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>N</td><td>N</td><td>N</td>
  </tr>

  <tr align="center">
   <td rowspan="4">suc&amp;#x327;on</td><td rowspan="2">Unicode</td><td>Plain text</td><td>Y</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>Y</td><td>N</td><td>N</td>
  </tr>
  <tr align="center">
   <td rowspan="2">Legacy</td><td>Plain text</td><td>N</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>N</td><td>N</td><td>N</td>
  </tr>

  <tr align="center">
   <td rowspan="2">¸on</td><td rowspan="2">Unicode</td><td>Plain text</td><td>Y</td><td>Y</td><td>N</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>Y</td><td>Y</td><td>N</td>
  </tr>

  <tr align="center">
   <td rowspan="4">&amp;#x327;on</td><td rowspan="2">Unicode</td><td>Plain text</td><td>Y</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>Y</td><td>Y</td><td>N</td>
  </tr>
  <tr align="center">
   <td rowspan="2">Legacy</td><td>Plain text</td><td>N</td><td>Y</td><td>Y</td>
  </tr>
  <tr align="center">
   <td>XML/HTML</td><td>N</td><td>Y</td><td>N</td>
  </tr>
 </tbody>
</table></figure>






           
</div3><div3 id="sec-XMLExamples"><head>Examples of XML in a Unicode encoding</head><p>Here is another
summary table, with more examples but limited to XML in a Unicode
encoding.  The following list describes what the entities contain and special character usage.  Normalised forms are indicated using <qchar>Y</qchar>. There is no precomposed <qterm>b with cedilla</qterm> in NFC.<ulist><item><p><quote>&amp;ccedil;</quote> <uname>LATIN SMALL LETTER C WITH CEDILLA</uname></p></item><item><p><quote>&amp;cedilla;</quote> <uname>CEDILLA</uname> (combining)</p></item><item><p><quote>&amp;c;</quote> <uname>LATIN SMALL LETTER C</uname></p></item><item><p><quote>&amp;b;</quote> <uname>LATIN SMALL LETTER B</uname></p></item><item><p><quote>¸</quote> <uname>CEDILLA</uname> (combining)</p></item><item><p><quote>/</quote> (immediately before <qterm>on</qterm> in last example) <uname>COMBINING LONG SOLIDUS OVERLAY</uname></p></item></ulist></p><figure><table border="1" cellpadding="5" cellspacing="0" summary="A table summarising what combinations of characters, character escapes, includes and constructs correspond to what type of normalization."><thead><tr><th> String</th><th align="center">Unicode normalised</th><th align="center">Include normalised</th><th align="center">Fully normalised</th></tr></thead><tbody><tr><td>suçon</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>sub¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>su&amp;#xE7;on</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>sub&amp;#x327;on</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>su&amp;#x62;¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>su&amp;ccedill;on</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>su&lt;![CDATA[çon]]&gt;</td><td align="center">Y</td><td align="center">Y</td><td align="center">Y</td></tr><tr><td>su&amp;b;¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>sub&amp;cedilla;on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>suc&lt;!--comment--&gt;¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>sub&lt;!--comment--&gt;¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>suc&lt;em&gt;¸&lt;/em&gt;on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>sub&lt;em&gt;¸&lt;/em&gt;on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>suc&lt;?proc-instr?&gt;¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>sub&lt;?proc-instr?&gt;¸on</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>sub&lt;![CDATA[¸on]]&gt;</td><td align="center">Y</td><td align="center">Y</td><td align="center">-</td></tr><tr><td>su&amp;c;¸on</td><td align="center">Y</td><td align="center">-</td><td align="center">-</td></tr><tr><td>suc&amp;#x327;on</td><td align="center">Y</td><td align="center">-</td><td align="center">-</td></tr><tr><td>su&amp;#x63;¸on</td><td align="center">Y</td><td align="center">-</td><td align="center">-</td></tr><tr><td>suc&amp;cedilla;on</td><td align="center">Y</td><td align="center">-</td><td align="center">-</td></tr><tr><td>suc&lt;![CDATA[¸on]]&gt;</td><td align="center">Y</td><td align="center">-</td><td align="center">-</td></tr><tr><td>suc¸on</td><td align="center">-</td><td align="center">-</td><td align="center">-</td></tr><tr><td>suç&lt;em&gt;/on&lt;/em&gt;</td><td align="center">-</td><td align="center">-</td><td align="center">-</td></tr></tbody></table></figure>



<note>
  <p>
  From the last example in the table above, it follows that it is impossible to produce a
    normalized XML or HTML document containing the character U+0338
    <uname>COMBINING LONG SOLIDUS OVERLAY</uname> immediately following an element
    tag, comment, CDATA section or processing instruction, since the U+0338
<qchar>/</qchar> combines with the <qchar>&gt;</qchar> (yielding U+226F
<uname>NOT GREATER-THAN</uname>). It is noteworthy that
    U+0338 <uname>COMBINING LONG SOLIDUS OVERLAY</uname> also combines with
    <qchar>&lt;</qchar>, yielding U+226E <uname>NOT LESS-THAN</uname>.
    Consequently, U+0338 <uname>COMBINING LONG SOLIDUS OVERLAY</uname> should
    remain excluded from the initial character of XML identifiers.</p>
</note></div3><div3 id="sec-Restrictions"><head>Examples of restrictions on the use of combining characters</head><p>Include-normalization and full-normalization create restrictions on the use
            of combining characters. The following examples discuss various such
            potential restrictions and how they can be addressed.</p> 
          <p>Full-normalization prevents the markup of an isolated combining mark,
            for example for styling it differently from its base character 
            (<code>Benoi&lt;span style='color: blue'&gt;^&lt;/span&gt;t</code>, where <qchar>^</qchar> represents a combining circumflex). However, the equivalent effect can be achieved by assigning a class
            to the accents in an SVG font or using equivalent technology. 
            <loc href="benoit.svg">View an example using SVG</loc> (SVG-enabled
            browsers only).</p> 
          <p>Full-normalization prevents the use of entities for expressing composing
            characters. This limitation can be circumvented by using character escapes or by
            using entities representing complete combining character sequences.
            With appropriate entity definitions, instead of <code>A&amp;acute;</code>,
            write <code>&amp;Aacute;</code> (or better, use <qchar>Á</qchar> directly).</p></div3></div2><div2 id="sec-NormalizationApplication"><head>Responsibility for Normalization</head>

		  <p>This section defines the responsibility for normalization, based on the goal of early uniform normalization.</p><p>Unless otherwise specified, the word <qterm>normalization</qterm> in this section may refer to <qterm>include-normalization</qterm> or <qterm>full-normalization</qterm>, depending on which is most appropriate for the specification or implementation under consideration.</p><p id="def-normalization-sensitive">An operation is <term>normalization-sensitive</term> if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are   textual, they are deemed different only if they would remain different were they to be normalized. These operations are any that involve comparison of characters or character counting, as well as some other operations such as ‘delete first character’ or ‘delete last character’.</p><p id="def-TPC">A <term>text-processing component</term> is a component that recognizes data as text.  This specification does not specify the boundaries of a text-processing component, which may be as small as one line of code or as large as a complete application.  A text-processing component may receive text, produce text, or both.</p><p id="def-suspect-text"><term>Certified text</term> is text which satisfies at least one of the following conditions: <olist><item><p>it has been confirmed through inspection that the text is in normalized form</p></item><item><p>the source <termref def="def-TPC">text-processing component</termref> is identified and is known to produce only normalized text.</p></item></olist></p><p><term>Suspect text</term> is text which is not certified.</p>

		  <p><req><req-type>C</req-type><req-text>In order to conform to this specification, 
		  all text content on the Web <rfc2119>MUST</rfc2119> be
      in <termref def="sec-IncludeNormalized">include-normalized</termref> form and <rfc2119>SHOULD</rfc2119> be in fully-normalized
      form.</req-text></req></p>

      <p><req><req-type>S</req-type><req-text>Specifications of text-based formats and protocols
      <rfc2119>MUST</rfc2119>, as part of their syntax definition, require that the text be in
      normalized form.</req-text></req></p>

      <p><req><req-type>S</req-type><req-type>I</req-type><req-text>A <termref def="def-TPC">text-processing component</termref> that
      receives <termref def="def-suspect-text">suspect text</termref> <rfc2119>MUST NOT</rfc2119> perform any <termref def="def-normalization-sensitive">normalization-sensitive</termref> operations
      unless it has first confirmed through inspection that the text is in normalized form, and
      <rfc2119>MUST NOT</rfc2119> normalize the <termref def="def-suspect-text">suspect text</termref>.  Private agreements
      <rfc2119>MAY</rfc2119>, however, be created within private systems which are not subject to
      these  rules, but any externally observable results <rfc2119>MUST</rfc2119> be the same as
      if the rules had been obeyed.</req-text></req></p>

      <p><req><req-type>I</req-type><req-text>A <termref def="def-TPC">text-processing component</termref> which modifies text and
      performs <termref def="def-normalization-sensitive">normalization-sensitive</termref> operations <rfc2119>MUST</rfc2119> behave <emph>as
      if</emph> normalization took place after each modification, so that any subsequent
      <termref def="def-normalization-sensitive">normalization-sensitive</termref> operations always behave <emph>as if</emph> they were dealing with
      normalized text.</req-text></req></p>

      <example><p>If the <qchar>z</qchar> is deleted
      from the (normalized) string <code>cz¸</code> (where <qchar>¸</qchar> represents a combining
      cedilla, U+0327), normalization is necessary to turn the denormalized result <code>c¸</code>
      into the properly normalized <code>ç</code>. If the software that deletes the <qchar>z</qchar> later uses the string in a
      <termref def="def-normalization-sensitive">normalization-sensitive</termref> operation, it needs to normalize the string before this operation to
      ensure correctness; otherwise, normalization may be deferred until the data is
      exposed. Analogous cases exist for insertion and
      concatenation (e.g.
<code>xf:concat(xf:substring('cz¸', 1, 1), xf:substring('cz¸', 3, 1))</code> in
XQuery <bibref ref="xquery-operators"/>).</p></example>

      <note><p>Software that denormalizes a string such as in the deletion example above does not
      need to perform a potentially expensive re-normalization of the whole string to ensure
      that the string is normalized.  It is sufficient to go back to the last non-<termref def="def-construct">composing character</termref>
      and re-normalize forward to the next non-composing character; if the string was normalized before
      the denormalizing operation, it will now be re-normalized.</p></note>

      <p><req><req-type>S</req-type><req-text>Specifications of  text-based languages and protocols
      <rfc2119>SHOULD</rfc2119> define precisely the <termref def="def-construct">construct</termref> boundaries necessary to obtain a
      complete definition of <termref def="sec-FullyNormalized">full-normalization</termref>. These definitions <rfc2119>MUST</rfc2119> include
      at least the boundaries between <termref def="def-char-data">markup</termref> and <termref def="def-char-data">character data</termref> as well as entity boundaries (if
      the language has any include mechanism) and <rfc2119>SHOULD</rfc2119> include any other
      boundary that may create denormalization when instances of the language are
      processed.</req-text></req></p>

<p><req><req-type>C</req-type><req-text>Even when authoring in a (formal) language that does not mandate
<termref def="sec-FullyNormalized">full-normalization</termref>, content developers <rfc2119>SHOULD</rfc2119> avoid <termref def="def-construct">composing characters</termref> at the beginning
of <termref def="def-construct">constructs</termref> that may be significant, such as at the beginning of an entity that will be included, immediately
after a <termref def="def-construct">construct</termref> that causes inclusion or immediately after <termref def="def-char-data">markup</termref>.</req-text></req>
<req><req-type>I</req-type><req-text>Authoring tool implementations for a (formal) language that does not mandate
<termref def="sec-FullyNormalized">full-normalization</termref> <rfc2119>SHOULD</rfc2119> prevent users from creating content with <termref def="def-construct">composing characters</termref> at the beginning
of <termref def="def-construct">constructs</termref> that may be significant, such as at the beginning of an entity that will be included, immediately
after a <termref def="def-construct">construct</termref> that causes inclusion or immediately after <termref def="def-char-data">markup</termref>, or <rfc2119>SHOULD</rfc2119> warn users
when they do so.</req-text></req></p>

      <p><req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST</rfc2119> document any
      security issues related to normalization.</req-text></req></p>

      <p><req><req-type>I</req-type><req-text>Implementations which transcode text from a
      legacy encoding to a <termref def="Unicode_Encoding_Form">Unicode encoding form</termref> <rfc2119>MUST</rfc2119> use a <termref def="def-normalizing-transcoder">normalizing
      transcoder</termref>.</req-text></req></p>

      <p><req><req-type>S</req-type><req-text>Specifications of API components (functions/methods)
      that perform operations that may produce unnormalized text output from normalized text input
      <rfc2119>MUST</rfc2119> define whether normalization is the responsibility of the caller or
      the callee. Specifications <rfc2119>MAY</rfc2119> make performing normalization optional for
      some API components; in this case the default <rfc2119>SHOULD</rfc2119> be that normalization
      is performed, and an explicit option <rfc2119>SHOULD</rfc2119> be used to switch normalization
      off. Specifications <rfc2119>MUST NOT</rfc2119> make the implementation of normalization
      optional.</req-text></req></p>

      <p><req><req-type>S</req-type><req-text>Specifications that define a mechanism (for example
      an API or a defining language) for producing a document SHOULD require that the final output
      of this mechanism be normalized.</req-text></req></p>

<example><p>XSL Transformations <bibref ref="xslt"/> and the DOM Load &amp; Save specification <bibref ref="dom3ls"/> are examples of specifications that define text output and that should
specify that this output be in normalized form.</p></example>

		<note>
     <p>As an optimization, it is perfectly acceptable for a
				<emph>system</emph> to define the <termref def="def-recipient-producer">producer</termref> to be the actual producer (e.g. a
				small device) together with a remote component (e.g. a server serving as a kind
				of proxy) to which normalization is delegated. In such a case, the
				communications channel between the device and proxy server is considered to be
				<emph>internal</emph> to the system, not part of the Web. Only data normalized
				by the proxy server is to be exposed to the Web at large, as shown in the
				illustration below:</p> <figure><image><graphic source="images/producer_proxy.png" width="500" height="450"/><alt>Illustration of a text producer defined as including a proxy.</alt></image>
			 <caption>Illustration of a text producer defined as including a
				proxy.</caption></figure>
      <p>A similar case would be that of a Web repository receiving content from a user and
       noticing that the content is not properly normalized. If the user so requests, it would
       certainly be proper for the repository to normalize the content on behalf of the user,
       the repository becoming effectively part of the <termref def="def-recipient-producer">producer</termref> for the duration of that operation.</p>
		  </note></div2>


	 </div1>
	 <div1 id="sec-Compatibility"><head>Compatibility and Formatting
		  Characters</head>
		<p>This specification does not address the suitability of particular
		  characters for use in <termref def="def-char-data">markup languages</termref>, in particular formatting characters and
		  compatibility equivalents. For detailed recommendations about the use of
		  compatibility and formatting characters, see <titleref>Unicode in XML and other
		  Markup Languages</titleref> <bibref ref="UXML"/>.</p>
		<p><req><req-type>S</req-type><req-text>Specifications
		  <rfc2119>SHOULD</rfc2119> exclude compatibility characters in the syntactic
		  elements (markup, delimiters, identifiers) of the formats they
		  define.</req-text></req></p>
	 </div1>
	 <div1 id="sec-IdentityMatching"><head>String Identity Matching</head>
		<p>One important operation that depends on early normalization is
		  <term>string identity matching</term> <bibref ref="CharReq"/>, which is a
		  subset of the more general problem of string matching. There are various
		  degrees of specificity for string matching, from approximate matching such as
		  regular expressions or phonetic matching, to more specific matches such as
		  case-insensitive or accent-insensitive matching and finally to identity
		  matching. In the Web environment, where multiple encodings are used to
		  represent strings, including some encodings which allow multiple
		  representations for the same thing, <term>identity</term> is defined to occur
		  if and only if the compared strings contain no user-identifiable distinctions.
		  This definition is such that strings do not match when they differ in case or
		  accentuation, but do match when they differ only in non-semantically
		  significant ways such as encoding, use of <termref def="sec-Escaping">character escapes</termref> (of potentially different
		  kinds), or use of precomposed vs. decomposed character sequences.</p>
		<p id="sid-steps">To avoid unnecessary conversions and, more importantly,
		  to ensure predictability and correctness, it is necessary for all components of
		  the Web to use the same identity testing mechanism. Conformance to the rule
		  that follows meets this requirement and supports the above definition of
		  identity. <req><req-type>S</req-type><req-type>I</req-type><req-text>String
		  identity matching <rfc2119>MUST</rfc2119> be performed as if the following
		  steps were followed:</req-text>
		  <olist>
			 <item>
				<p>Early uniform normalization to fully-normalized form, as defined
          in <specref ref="sec-FullyNormalized"/>. In accordance with section
				  <specref ref="sec-Normalization"/>, this step <rfc2119>MUST</rfc2119> be
				  performed by the <emph>producers</emph> of the strings to be compared.</p>
			 </item>
			 <item>
				<p>Conversion to a common encoding of UCS, if necessary.</p>
			 </item>
			 <item>
				<p>Expansion of all recognized <termref def="sec-Escaping">character escapes</termref> and <termref def="def-include">includes</termref>.</p>
			 </item>
			 <item>
				<p>Testing for bit-by-bit identity.</p>
			 </item>
		  </olist></req></p>
		<p>Step 1 ensures 1) that the identity matching process can produce
		  correct results using the next three steps and 2) that a minimum of effort is
		  spent on solving the problem.</p>
		<note>
		  <p>The expansion of character escapes and includes (step 3 above) is dependent on context,
			 i.e. on which markup or programming language is considered to apply when the
			 string matching operation is performed. Consider a search for the string
			 <qterm>suçon</qterm> in an XML document containing <code>su&amp;#xE7;on</code> but not <code>suçon</code>. If the search is performed in a plain text editor, the context is
			 <term>plain text</term> (no markup or programming language applies), the
			 &amp;#xE7; character escape is not recognized, hence not expanded and the search fails.
			 If the search is performed in an XML browser, the context is <term>XML</term>,
			 the character escape (defined by XML) is expanded and the search succeeds. </p>
		  <p>An intermediate case would be an XML editor that
			 <emph>purposefully</emph> provides a view of an XML document with entity
			 references left unexpanded. In that case, a search over that pseudo-XML view
			 will deliberately <emph>not</emph> expand entities: in that particular context,
			 entity references are not considered includes and need not be expanded.</p>
		</note>
		<p><req><req-type>S</req-type><req-type>I</req-type><req-text>Forms of
		  string matching other than identity matching <rfc2119>SHOULD</rfc2119> be performed as if the following steps were followed:</req-text><olist>
			 <item>
				<p>Steps 1 to 3 for <loc href="#sid-steps">string identity matching</loc>.</p>
			 </item>
			 
			 
			 <item>
				<p>Matching the strings in a way that is appropriate to the application.</p>
			 </item>
		  </olist></req></p><p>Appropriate methods of matching text outside of string identity matching can include such things as case-insensitive matching, accent-insensitive matching,  matching characters against Unicode compatibility forms,  expansion of abbreviations, matching of stemmed words, phonetic matching, etc.</p>
	 <example><p>A user who specifies a search for the  string <code>suçon</code> against a Unicode encoded XML document would expect to find string identity matches against the strings <code>su&amp;#xE7;on</code>, <code>su&amp;#231;on</code> and <code>su&amp;ccedill;on</code> (where the entity &amp;ccedil; represents the precomposed character <qchar>ç</qchar>). Identity matches should also be found whether the string was encoded as <code>73 75 C3 A7 6F 6E</code> (in UTF-8) or  <code>0073 0075 00E7 006F 006E</code> (in UTF-16), or any other encoding that can be transcoded into normalised Unicode.</p><p>It should never be the case that a match would be attempted against strings such as <code>suc&amp;#x327;on</code> or <code>suc¸on</code> since these are not fully-normalised and should cause the text to be rejected.  If, however, matching is done against such strings they should also match since they are canonically equivalent.</p><p>Forms of matching other than identity, if supported by the application, would have to be used to produce a match against the following strings: <code>SUÇON</code> (case-insensitive matching), <code>sucon</code> (accent-insensitive matching), <code>suçons</code> (matched stems), <code>suçant</code> (phonetic matching), etc.</p></example></div1>
	 <div1 id="sec-Indexing"><head>String Indexing</head>
		<p>There are many situations where a software process needs to access a
		  substring or to point within a string and does so by the use of
		  <term>indices</term>, i.e. numeric <quote>positions</quote> within a string.
		  Where such indices are exchanged between components of the Web, there is a need
		  for an agreed-upon definition of string indexing in order to ensure consistent
		  behavior. The requirements for string indexing are discussed in
		  <titleref>Requirements for String Identity Matching</titleref>
		  <bibref ref="CharReq"/>,
		  <loc href="http://www.w3.org/TR/WD-charreq#4">section 4</loc>. The two
		  main questions that arise are: <quote>What is the unit of counting?</quote> and
		  <quote>Do we start counting at 0 or 1?</quote>.</p>
		<example><p>Consider the string <image><graphic source="images/surrogateDiffQcaron.gif" width="66" height="26"/><alt>Chinese character for 'stump of tree' does not equal Latin small letter q with combining caron</alt></image> encoded in UTF-16 in big-endian byte order. The rows of the following table show the
string viewed as a <termref def="def-character-string">character string</termref>, <termref def="def-physical-string">code unit string</termref> and <termref def="def-byte-string">byte string</termref>, respectively, each of which involves different units for indexing.</p>

<figure>
 <table border="1" cellpadding="5" cellspacing="0" summary="table displaying a string viewed as characters, code units and bytes">
  <tbody>
   <tr align="center">
    <th align="right">Character string</th>
    <td colspan="4">U+233B4 <image><graphic source="images/chineseSurrogate.gif" width="24" height="25"/><alt>Archaic Chinese character meaning "the stump of a tree" (still in current use in Cantonese)</alt></image></td>
    <td colspan="2">U+2260 <image><graphic source="images/not_equal.gif" width="25" height="26"/><alt>NOT EQUAL TO</alt></image></td>
    <td colspan="2">U+0071 <image><graphic source="images/Q.gif" width="14" height="21"/><alt>LATIN SMALL LETTER Q</alt></image></td>
    <td colspan="2">U+030C <image><graphic source="images/caron.gif" width="14" height="21"/><alt>COMBINING CARON</alt></image></td>
   </tr>
   <tr align="center">
    <th align="right">Code unit string</th>
    <td colspan="2">D84C</td>
    <td colspan="2">DFB4</td>
    <td colspan="2">2260</td>
    <td colspan="2">0071</td>
    <td colspan="2">030C</td>
   </tr>
   <tr align="center">
    <th align="right">Byte string</th>
    <td>D8</td>
    <td>4C</td>
    <td>DF</td>
    <td>B4</td>
    <td>22</td>
    <td>60</td>
    <td>00</td>
    <td>71</td>
    <td>03</td>
    <td>0C</td>
   </tr>
  </tbody>
 </table>
</figure>
</example><p>Depending on the particular requirements of a process, the unit of
		  counting may correspond to definitions of a string provided in
		  section <specref ref="sec-Strings"/>. In particular:
		  <ulist>
			 <item>
				<p><req><req-type>S</req-type><req-type>I</req-type><req-text>The
				  <termref def="def-character-string">character string</termref> is
				  <rfc2119>RECOMMENDED</rfc2119> as a basis for string indexing.</req-text></req>
				  (Example: the XML Path Language <bibref ref="xpath"/>).</p>
			 </item>
			 <item>
				<p><req><req-type>S</req-type><req-type>I</req-type><req-text>A
				  <termref def="def-physical-string">code unit string</termref>
				  <rfc2119>MAY</rfc2119> be used as a basis for string indexing if this results
				  in a significant improvement in the efficiency of internal operations when
				  compared to the use of <termref def="def-character-string">character string</termref>.</req-text></req> (Example: the use of
				  UTF-16 in <bibref ref="dom1"/>).</p>
			 </item>
		  </ulist></p>
		<p>Counting <termref def="def-grapheme-string">graphemes</termref> will
		  become a good option where user interaction is the primary concern, once a
		  suitable definition is widely accepted. The use of <termref def="def-byte-string">byte strings</termref> for indexing is discouraged.</p>
		<p>It is noteworthy that there exist other, non-numeric ways of
		  identifying substrings which have favorable properties. For instance,
		  substrings based on string matching are quite robust against small edits;
		  substrings based on document structure (in structured formats such as XML) are
		  even more robust against edits and even against translation of a document from
		  one human language to another.
		  <req><req-type>S</req-type><req-text>Specifications that need a way to identify
		  substrings or point within a string <rfc2119>SHOULD</rfc2119> provide ways
		  other than string indexing to perform this operation.</req-text></req>
		  <req><req-type>I</req-type><req-type>C</req-type><req-text>Users of
		  specifications (software developers, content developers)
		  <rfc2119>SHOULD</rfc2119> whenever possible prefer ways other than string
		  indexing to identify substrings or point within a string.</req-text></req></p>
		<p>Experience shows that more general, flexible and robust specifications
		  result when individual characters are understood and processed as substrings,
		  identified by a position before and a position after the substring.
		  Understanding indices as boundary positions <emph>between</emph> the counting
		  units also makes it easier to relate the indices resulting from the different
		  string definitions. <req><req-type>S</req-type><req-text>Specifications
		  <rfc2119>SHOULD</rfc2119> understand and process single characters as
		  substrings, and treat indices as boundary positions <emph>between</emph>
		  counting units, regardless of the choice of counting
		  units.</req-text></req></p>
		<p><req><req-type>S</req-type><req-text>Specifications of APIs
		  <rfc2119>SHOULD NOT</rfc2119> specify single character or single encoding-unit
		  arguments.</req-text></req></p><example><p><code>uppercase('ß')</code> cannot return the proper result (the two-character string
		<qchar>SS</qchar>) if the return type of the <function>uppercase</function>
		function is defined to be a single character.</p></example>
		<p>The issue of index origin, i.e. whether we count from 0 or 1, actually
		  arises only after a decision has been made on whether it is the units
		  themselves that are counted or the positions between the units.
		  <req><req-type>S</req-type><req-text>When the positions between the units are
		  counted for string indexing, starting with an index of 0 for the position at
		  the start of the string is the <rfc2119>RECOMMENDED</rfc2119> solution, with
		  the last index then being equal to the number of counting units in the
		  string.</req-text></req></p>
	 </div1>
	 <div1 id="sec-URIs"><head>Character Encoding in URI References</head>
		<p>According to the definition in RFC 2396 <bibref ref="rfc2396"/>, URI
		  references are restricted to a subset of US-ASCII, with an escaping mechanism
		  to encode arbitrary byte values, using the %HH convention. However, the %HH
		  convention by itself is of limited use because there is no definitive mapping
		  from characters to bytes. Also, non-ASCII characters cannot be used directly.
		  <titleref>Internationalized Resource Identifiers (IRI)</titleref>
		  <bibref ref="uri-i18n"/> solves both problems with an uniform approach that
		  conforms to the
		  <termref def="sec-RefProcModel">Reference Processing Model</termref>. </p>
		<p><req><req-type>S</req-type><req-text>W3C specifications that define
		  protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are
		  to be interpreted as URI references (or specific subsets of URI references,
		  such as absolute URI references, URIs, etc.) <rfc2119>SHOULD</rfc2119> use
		  <titleref>Internationalized Resource Identifiers (IRI)</titleref>
		  <bibref ref="uri-i18n"/> (or an appropriate subset thereof).</req-text></req>
		  <req><req-type>S</req-type><req-text>W3C specifications <rfc2119>MUST</rfc2119>
		  define when the conversion from IRI references to URI references (or subsets
		  thereof) takes place, in accordance with <titleref>Internationalized Resource
		  Identifiers (IRI)</titleref> <bibref ref="uri-i18n"/>.</req-text></req></p>
		<note>
		  <p>Many current W3C specifications already contain provisions in
			 accordance with <titleref>Internationalized Resource Identifiers
			 (IRI)</titleref> <bibref ref="uri-i18n"/>. For XML 1.0 <bibref ref="xml10"/>,
			 see <xspecref href="http://www.w3.org/TR/REC-xml#sec-external-ent">Section
			 4.2.2, External Entities</xspecref>, and
			 <xspecref href="http://www.w3.org/XML/xml-V10-2e-errata#E26">Erratum
			 E26</xspecref>. XML Schema Part 2: Datatypes <bibref ref="xmlschema-2"/>
			 provides the <kw>anyURI</kw> datatype (see
			 <xspecref href="http://www.w3.org/TR/xmlschema-2/#anyURI">Section
			 3.2.17</xspecref>). The XML Linking Language (XLink) <bibref ref="xlink"/>
			 provides the href attribute (see
			 <xspecref href="http://www.w3.org/TR/xlink/#link-locators">Section 5.4, Locator
			 Attribute</xspecref>). Further information and links can be found at
			 <titleref>Internationalization: URIs and other identifiers</titleref>
			 <bibref ref="i18nuri"/>.</p>
		</note>
		<p><req><req-type>S</req-type><req-text>W3C specifications that define
		  new syntax for URIs, such as a new URI scheme or a new kind of fragment
		  identifier, <rfc2119>MUST</rfc2119> specify that characters outside the
		  US-ASCII repertoire are encoded using UTF-8 and %HH-escaping, in accordance
		  with <titleref>Guidelines for new URL Schemes</titleref> <bibref ref="rfc2718"/>,
        Section 2.2.5.</req-text></req>
        <req><req-type>S</req-type><req-text>Such specifications SHOULD also define the
        normalization requirements for the syntax they introduce.</req-text></req></p>
	 </div1>

	 <div1 id="sec-RefUnicode"><head>Referencing the Unicode Standard and
		  ISO/IEC 10646</head>
		<p>Specifications often need to make references to the Unicode standard
		  or International Standard ISO/IEC 10646. Such references must be made with
		  care, especially when normative. The questions to be considered are:
		  <ulist>
			 <item>
				<p>Which standard should be referenced?</p>
			 </item>
			 <item>
				<p>How to reference a particular version?</p>
			 </item>
			 <item>
				<p>When to use versioned vs unversioned references?</p>
			 </item>
		  </ulist></p>
		<p>ISO/IEC 10646 is developed and published jointly by
		  <loc href="http://www.iso.ch/">ISO</loc> (the International
		  Organisation for Standardisation) and
		  <loc href="http://www.iec.ch/">IEC</loc> (the International
		  Electrotechnical Commission). The Unicode Standard is developed and published
		  by the
		  <loc href="http://www.unicode.org/">Unicode Consortium</loc>, an
		  organization of major computer corporations, software producers, database
		  vendors, national governments, research institutions, international agencies,
		  various user groups, and interested individuals. The Unicode Standard is
		  comparable in standing to W3C Recommendations.</p>
		<p>ISO/IEC 10646 and Unicode define exactly the same <termref def="def-CCS">CCS</termref> (same
		  <termref def="def-repertoire">repertoire</termref>, same <termref def="def-CCS">code points</termref>) and encoding forms. They are actively maintained
		  in synchrony by liaisons and overlapping membership between the respective
		  technical committees. In addition to the jointly defined CCS and encoding
		  forms, the Unicode Standard adds normative and informative lists of character
		  properties, normative character equivalence and normalization specifications, a
		  normative algorithm for bidirectional text and a large amount of useful
		  implementation information. In short, Unicode adds semantics to the characters
		  that ISO/IEC 10646 merely enumerates. Conformance to Unicode implies
		  conformance to ISO/IEC 10646, see <bibref ref="unicode30"/> Appendix C.</p>
		<p><req><req-type>S</req-type><req-text>Since specifications in general
		  need both a definition for their characters and the semantics associated with
		  these characters, specifications <rfc2119>SHOULD</rfc2119> include a reference
		  to the Unicode Standard, whether or not they include a reference to ISO/IEC
		  10646.</req-text></req> By providing a reference to The Unicode Standard
		  implementers can benefit from the wealth of information provided in the
		  standard and on the Unicode Consortium Web site.</p>
		<p>The fact that both ISO/IEC 10646 and Unicode are evolving (in
		  synchrony) raises the issue of versioning: should a specification refer to a
		  specific version of the standard, or should it make a generic reference, so
		  that the normative reference is to the version current at the time of
		  <emph>reading</emph> the specification? In general the answer is
		  <emph>both</emph>. <req><req-type>S</req-type><req-text>A generic reference to
		  the Unicode Standard <rfc2119>MUST</rfc2119> be made if it is desired that
		  characters allocated after a specification is published are usable with that
		  specification. A specific reference to the Unicode Standard
		  <rfc2119>MAY</rfc2119> be included to ensure that functionality depending on a
		  particular version is available and will not change over time (an example would
		  be the set of characters acceptable as Name characters in XML 1.0
		  <bibref ref="xml10"/>, which is an enumerated list that parsers must implement
		  to validate names).</req-text></req></p>
		<note>
		  <p>See<loc href="http://www.unicode.org/unicode/standard/versions/#Citations">
				http://www.unicode.org/unicode/standard/versions/#Citations</loc> for guidance
			 on referring to specific versions of Unicode.</p>
		</note>
		<p>A generic reference can be formulated in two ways:
		  <olist>
			 <item>
				<p>By explicitly including a <emph>generic</emph> entry in the
				  bibliography section of a specification and simply referring to that entry in
				  the body of the specification. Such a generic entry contains text such as
				  <quote>... as it may from time to time be revised or amended</quote>.</p>
			 </item>
			 <item>
				<p>By including a <emph>specific</emph> entry in the bibliography
				  and adding text such as <quote>... as it may from time to time be revised or
				  amended</quote> at the point of reference in the body of the specification.</p>

			 </item>
		  </olist></p>
		<p>It is an editorial matter, best left to each specification, which of
		  these two formulations is used. Examples of the first formulation can be found
		  in the bibliography of this specification (see the entries for
		  <bibref ref="iso10646"/> and <bibref ref="unicode"/>). Examples of the latter,
		  as well as a discussion of the versioning issue with respect to MIME
		  <kw>charset</kw> parameters for UCS encodings, can be found in
		  <bibref ref="rfc2279"/> and <bibref ref="rfc2781"/>.</p>
		<p><req><req-type>S</req-type><req-text>All <emph>generic</emph>
		  references to Unicode <bibref ref="unicode"/> <rfc2119>MUST</rfc2119> refer to Unicode 3.0 <bibref ref="unicode30"/> or later.</req-text></req>
		  <req><req-type>S</req-type><req-text>Generic references to ISO/IEC 10646 <bibref ref="iso10646"/>
		  <rfc2119>MUST</rfc2119> be written such that they make allowance for the future
		  publication of additional <emph>parts</emph> of the standard. When referring to Part 1, they
		  <rfc2119>MUST</rfc2119> refer to ISO/IEC 10646-1:2000
		  <bibref ref="iso10646-2000"/> or later, including any
		  amendments.</req-text></req></p>
	 </div1>
  </body>
  <back>
     <div1 id="sec-References"><head>References</head>
		<div2 id="sec-NormativeReferences"><head>Normative
			 References</head><blist>
			 <bibl id="iana" key="IANA">Internet Assigned Numbers Authority,
				<titleref href="http://www.iana.org/assignments/character-sets">Official
				Names for Character Sets</titleref>. (See
				 <loc href="http://www.iana.org/assignments/character-sets">http://www.iana.org/assignments/character-sets</loc>.)
				</bibl>
			 <bibl id="iso10646" key="ISO/IEC 10646">ISO/IEC 10646-1:2000,
				<titleref href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">Information
				technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
        Architecture and Basic Multilingual Plane</titleref> and ISO/IEC 10646-2:2001,
        <titleref href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=33208">Information
        technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2:
        Supplementary Planes</titleref>, as, from time to time,
				amended, replaced by a new edition or expanded by the addition of new parts.
				(See
				<loc href="http://www.iso.ch/">http://www.iso.ch</loc> for the
				latest version.)</bibl>
       <bibl key="ISO/IEC 10646-1:2000" id="iso10646-2000">ISO/IEC
        10646-1:2000,
        <titleref href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">Information
        technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
        Architecture and Basic Multilingual Plane</titleref>. (See
        <loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819</loc>.)
        </bibl>
       <bibl key="ISO/IEC 10646-2:2001" id="iso10646-2001">ISO/IEC
        10646-2:2001,
        <titleref href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=33208">Information
        technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2:
        Supplementary Planes</titleref>. (See
        <loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=33208">http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=33208</loc>.)
        </bibl>
			 <bibl id="MIME" key="MIME"><titleref href="http://www.ietf.org/rfc/rfc2045.txt">Multipurpose Internet Mail
				Extensions (MIME). Part One: Format of Internet Message Bodies</titleref>, N.
				Freed, N. Borenstein, RFC 2045, November 1996,
				<loc href="http://www.ietf.org/rfc/rfc2045.txt">http://www.ietf.org/rfc/rfc2045.txt</loc>.
				<titleref>Part Two: Media Types</titleref>, N. Freed, N. Borenstein, RFC 2046,
				November 1996. <titleref>Part Three: Message Header Extensions for Non-ASCII
				Text</titleref>, K. Moore, RFC 2047, November 1996. <titleref>Part Four:
				Registration Procedures</titleref>, N. Freed, J. Klensin, J. Postel, RFC 2048,
				November 1996. <titleref>Part Five: Conformance Criteria and
				Examples</titleref>, N. Freed, N. Borenstein, RFC 2049, November 1996. </bibl>
			 
			 <bibl id="rfc2119" key="RFC 2119">S. Bradner,
				<titleref href="http://www.ietf.org/rfc/rfc2119.txt">Key words for use in RFCs
				to Indicate Requirement Levels</titleref>, IETF RFC 2119. (See
				<loc href="http://www.ietf.org/rfc/rfc2119.txt">http://www.ietf.org/rfc/rfc2119.txt</loc>.)
				</bibl>
			 <bibl id="rfc2396" key="RFC 2396">T. Berners-Lee, R. Fielding, L.
				Masinter, <titleref href="http://www.ietf.org/rfc/rfc2396.txt">Uniform Resource
				Identifiers (URI): Generic Syntax</titleref>, IETF RFC 2396, August 1998. (See
				<loc href="http://www.ietf.org/rfc/rfc2396.txt">http://www.ietf.org/rfc/rfc2396.txt</loc>.)
				</bibl>
			 <bibl id="rfc2732" key="RFC 2732">R. Hinden, B. Carpenter, L.
				Masinter, <titleref href="http://www.ietf.org/rfc/rfc2732.txt">Format for
				Literal IPv6 Addresses in URL's</titleref>, IETF RFC 2732, 1999. (See
				<loc href="http://www.ietf.org/rfc/rfc2732.txt">http://www.ietf.org/rfc/rfc2732.txt</loc>.)
				</bibl>
			 <bibl id="unicode" key="Unicode">The Unicode Consortium,
				<titleref>The Unicode Standard, Version 3</titleref>, ISBN 0-201-61633-5,
				as updated from time to time by the publication of new versions. (See
				<loc href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions</loc>
				for the latest version and additional information on versions of the standard
				and of the Unicode Character Database).</bibl>
			 <bibl id="unicode30" key="Unicode  3.0">The Unicode Consortium,
				<titleref>The Unicode Standard, Version 3.0</titleref>, ISBN 0-201-61633-5.
				(See
				<loc href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">http://www.unicode.org/unicode/standard/versions/Unicode3.0.html</loc>.)
				</bibl>
			 <bibl id="unicode31" key="Unicode  3.1">The Unicode Consortium,
				<titleref>The Unicode Standard, Version 3.1.0</titleref> is defined by <titleref>The Unicode Standard, Version 3.0</titleref> (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the <titleref>Unicode Standard Annex #27: Unicode 3.1</titleref> (see <loc href="http://www.unicode.org/reports/tr27/">http://www.unicode.org/reports/tr27</loc>).</bibl><bibl id="unicode32" key="Unicode  3.2">The Unicode Consortium,
				<titleref>The Unicode Standard, Version 3.2.0</titleref> is defined by <titleref>The Unicode Standard, Version 3.0</titleref> (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the <titleref>Unicode Standard Annex #27: Unicode 3.1</titleref> (see <loc href="http://www.unicode.org/reports/tr27/">http://www.unicode.org/reports/tr27</loc>) and by the <titleref>Unicode Standard Annex #28: Unicode 3.2</titleref> (see <loc href="http://www.unicode.org/reports/tr28/">http://www.unicode.org/reports/tr28</loc>).</bibl><bibl id="UTR15" key="UTR #15">Mark Davis, Martin Dürst,
				<titleref href="http://www.unicode.org/unicode/reports/tr15/">Unicode
				Normalization Forms,</titleref> Unicode Standard Annex #15. (See
				<loc href="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15</loc>
				for the latest version).</bibl></blist>
		</div2>
		<div2 id="sec-OtherReferences"><head>Other References</head><blist>
			 <bibl id="CharReq" key="CharReq">Martin J. Dürst,
				<titleref href="http://www.w3.org/TR/WD-charreq">Requirements for String Identity Matching and String Indexing</titleref>, W3C Working
				Draft. (See
				<loc href="http://www.w3.org/TR/WD-charreq">http://www.w3.org/TR/WD-charreq</loc>.)
				</bibl>
			 <bibl id="connolly" key="Connolly">D. Connolly,
				<titleref href="http://www.w3.org/MarkUp/html-spec/charset-harmful">Character
				Set Considered Harmful</titleref>, W3C Note. (See
				<loc href="http://www.w3.org/MarkUp/html-spec/charset-harmful">http://www.w3.org/MarkUp/html-spec/charset-harmful</loc>.)</bibl>

			 <bibl id="css2" key="CSS2">Bert Bos, Håkon Wium Lie, Chris Lilley,
				Ian Jacobs, Eds., <titleref href="http://www.w3.org/TR/REC-CSS2/">Cascading
				Style Sheets, level 2</titleref> (CSS2 Specification), W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/REC-CSS2/">http://www.w3.org/TR/REC-CSS2</xspecref>.)
				</bibl>
			 <bibl id="dom1" key="DOM Level 1">Vidur Apparao et al.,
				<titleref href="http://www.w3.org/TR/REC-DOM-Level-1/">Document Object Model
				(DOM) Level 1 Specification</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/REC-DOM-Level-1/">http://www.w3.org/TR/REC-DOM-Level-1</xspecref>.)
				</bibl>

<bibl id="dom3ls" key="DOM3 LS">Ben Chang, Jeroen van Rotterdam, Johnny Stenback, Andy Heninger, Joe Kesselman, Rezaur Rahman Eds., <titleref href="http://www.w3.org/TR/DOM-Level-3-ASLS/">Document Object Model (DOM) Level 3 Abstract Schemas and Load and Save Specification</titleref>, W3C Working Draft. (See <xspecref href="http://www.w3.org/TR/DOM-Level-3-ASLS/">http://www.w3.org/TR/DOM-Level-3-ASLS</xspecref>.)</bibl>

			 <bibl id="html40" key="HTML 4.0">Dave Raggett, Arnaud Le Hors, Ian
				Jacobs, Eds., <titleref href="http://www.w3.org/TR/REC-html40-971218/">HTML 4.0
				Specification</titleref>, W3C Recommendation, 18-Dec-1997 (See
				<xspecref href="http://www.w3.org/TR/REC-html40-971218/">http://www.w3.org/TR/REC-html40-971218</xspecref>.)</bibl>

			 <bibl id="html401" key="HTML 4.01">Dave Raggett, Arnaud Le Hors, Ian
				Jacobs, Eds., <titleref href="http://www.w3.org/TR/html401/">HTML 4.01
				Specification</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/html401/">http://www.w3.org/TR/html401</xspecref>.)
				</bibl>
			 <bibl id="uri-i18n" key="I-D IRI">Martin Dürst, Michel Suignard,
				<titleref href="http://www.w3.org/International/2002/draft-duerst-iri-00.txt">Internationalized
				Resource Identifiers (IRI)</titleref>, Internet-Draft, April 2002. (See
				<loc href="http://www.w3.org/International/2002/draft-duerst-iri-00.txt">http://www.w3.org/International/2002/draft-duerst-iri-00.txt</loc>.)</bibl>
			 <bibl id="i18nuri" key="Info URI-I18N"><titleref href="http://www.w3.org/International/O-URL-and-ident">Internationalization:
				URIs and other identifiers</titleref>. (See
				<loc href="http://www.w3.org/International/O-URL-and-ident">http://www.w3.org/International/O-URL-and-ident</loc>.)
				</bibl>
			 <bibl id="iso14651" key="ISO/IEC 14651">ISO/IEC 14651:2000,
				<loc href="http://www.iso.ch/">Information
				  technology -- International string
ordering and comparison -- Method for comparing character strings and
description of the common template tailorable ordering</loc> as, from time
to time, amended, replaced by a new edition or expanded by the addition
of new parts. (See <loc href="http://www.iso.ch/">http://www.iso.ch</loc> for the latest version.)</bibl><bibl id="iso9541" key="ISO/IEC 9541-1">ISO/IEC 9541-1:1991,
				<loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277">Information
				  technology -- Font information interchange -- Part 1: Architecture</loc>. (See
				<loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277">http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277</loc>
				for the latest version.) </bibl>
			 <bibl id="mathml2" key="MathML2">David Carlisle, Patrick Ion, Robert
				Miner, Nico Poppelier, Eds., <titleref href="http://www.w3.org/TR/MathML2/">Mathematical Markup Language (MathML)
				Version 2.0</titleref>, W3C Recommendation. (See
				<loc href="http://www.w3.org/TR/MathML2/">http://www.w3.org/TR/MathML2</loc>.)
				</bibl>
			 <bibl id="Nicol" key="Nicol">Gavin Nicol,
				<titleref href="http://www.mind-to-mind.com/i18n/articles/multilingual/multilingual-www.html">The
				Multilingual World Wide Web</titleref>, Chapter 2: The WWW As A Multilingual
				Application. (See
				<loc href="http://www.mind-to-mind.com/i18n/articles/multilingual/multilingual-www.html">http://www.mind-to-mind.com/i18n/articles/multilingual/multilingual-www.html</loc>.)
				</bibl>
			 <bibl id="rfc2070" key="RFC 2070">F. Yergeau, G. Nicol, G. Adams, M.
				Dürst, <titleref href="http://www.ietf.org/rfc/rfc2070.txt">Internationalization of the
				Hypertext Markup Language</titleref>, IETF RFC 2070, January 1997. (See
				<loc href="http://www.ietf.org/rfc/rfc2070.txt">http://www.ietf.org/rfc/rfc2070.txt</loc>.)</bibl>

			 <bibl id="rfc2277" key="RFC 2277">H. Alvestrand,
				<titleref href="http://www.ietf.org/rfc/rfc2277.txt">IETF Policy on Character
				Sets and Languages</titleref>, IETF RFC 2277, BCP 18, January 1998. (See
				<loc href="http://www.ietf.org/rfc/rfc2277.txt">http://www.ietf.org/rfc/rfc2277.txt</loc>.)
				</bibl>
			 <bibl id="rfc2279" key="RFC 2279">F. Yergeau,
				<titleref href="http://www.ietf.org/rfc/rfc2279.txt">UTF-8, a transformation
				format of ISO 10646</titleref>, IETF RFC 2279, January 1998. (See
				<loc href="http://www.ietf.org/rfc/rfc2279.txt">http://www.ietf.org/rfc/rfc2279.txt</loc>.)
				</bibl>
			 <bibl key="RFC 2718" id="rfc2718">L. Masinter, H. Alvestrand, D.
				Zigmond, R. Petke, <titleref href="http://www.ietf.org/rfc/rfc2718.txt">Guidelines for new URL
				Schemes</titleref>, IETF RFC 2718, November 1999. (See
				<loc href="http://www.ietf.org/rfc/rfc2718.txt">http://www.ietf.org/rfc/rfc2718.txt</loc>.)</bibl>

			 <bibl id="rfc2781" key="RFC 2781">P. Hoffman, F. Yergeau,
				<titleref href="http://www.ietf.org/rfc/rfc2781.txt">UTF-16, an encoding of ISO
				10646</titleref>, IETF RFC 2781, February 2000. (See
				<loc href="http://www.ietf.org/rfc/rfc2781.txt">http://www.ietf.org/rfc/rfc2781.txt</loc>.)</bibl>

			 <bibl id="spread" key="SPREAD"><titleref href="http://www.ascc.net/xml/resource/entities/index.html">SPREAD -
				Standardization Project for East Asian Documents Universal Public Entity
				Set</titleref>. (See
				<loc href="http://www.ascc.net/xml/resource/entities/index.html">http://www.ascc.net/xml/resource/entities/index.html</loc>)
				</bibl>
			 <bibl id="svg" key="SVG">Jon Ferraiolo, Ed.,
				<titleref href="http://www.w3.org/TR/SVG/">Scalable Vector Graphics (SVG) 1.0
				Specification</titleref>, W3C Recommendation. (See
				<loc href="http://www.w3.org/TR/SVG/">http://www.w3.org/TR/SVG</loc>.) </bibl>
			 <bibl id="UTR10" key="UTR #10">Mark Davis,
				Ken Whistler, <titleref href="http://www.unicode.org/unicode/reports/tr10/">Unicode Collation Algorithm</titleref>, Unicode Technical Report #10. (See
				<loc href="http://www.unicode.org/unicode/reports/tr10/">http://www.unicode.org/unicode/reports/tr10</loc>.)
				</bibl><bibl id="UTR17" key="UTR #17">Ken Whistler, Mark Davis,
				<titleref href="http://www.unicode.org/unicode/reports/tr17/">Character
				Encoding Model</titleref>, Unicode Technical Report #17. (See
				<loc href="http://www.unicode.org/unicode/reports/tr17/">http://www.unicode.org/unicode/reports/tr17</loc>.)
				</bibl>
			 <bibl id="UXML" key="UXML">Martin Dürst and Asmus Freytag,
				<titleref href="http://www.w3.org/TR/unicode-xml/">Unicode in XML and other
				Markup Languages</titleref>, Unicode Technical Report #20 and W3C Note. (See
				<xspecref href="http://www.w3.org/TR/unicode-xml/">http://www.w3.org/TR/unicode-xml</xspecref>.)</bibl>

			 <bibl id="xlink" key="XLink">Steve DeRose, Eve Maler, David Orchard,
				Eds, <titleref href="http://www.w3.org/TR/xlink/">XML Linking Language (XLink)
				Version 1.0</titleref>, W3C Recommendation. (See
				<loc href="http://www.w3.org/TR/xlink/">http://www.w3.org/TR/xlink</loc>.) </bibl>
			 <bibl id="xml10" key="XML 1.0">Tim Bray, Jean Paoli, C. M.
				Sperberg-McQueen, Eve Maler, Eds.,
				<titleref href="http://www.w3.org/TR/REC-xml">Extensible Markup Language (XML)
				1.0</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/REC-xml">http://www.w3.org/TR/REC-xml</xspecref>.)
				</bibl>
			 <bibl key="XML Schema-2" id="xmlschema-2">Paul V. Biron , Ashok
				Malhotra , Eds., <titleref href="http://www.w3.org/TR/xmlschema-2/">XML Schema
				Part 2: Datatypes</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/xmlschema-2/">http://www.w3.org/TR/xmlschema-2</xspecref>.)</bibl>

			 <bibl id="XML_Japanese_profile" key="XML Japanese Profile">MURATA
				Makoto Ed., <titleref href="http://www.w3.org/TR/japanese-xml/">XML Japanese
				Profile</titleref>, W3C Note. (See
				<loc href="http://www.w3.org/TR/japanese-xml/">http://www.w3.org/TR/japanese-xml</loc>.)
				</bibl>
			 <bibl id="xpath" key="XPath">James Clark, Steve DeRose, Eds,
				<titleref href="http://www.w3.org/TR/xpath">XML Path Language (XPath) Version
				1.0</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/xpath">http://www.w3.org/TR/xpath</xspecref>.)</bibl>

			 <bibl id="xquery-operators" key="XQuery Operators">Ashok Malhotra, Jim Melton, Jonathan Robie, Norman Walsh, Eds, <titleref href="http://www.w3.org/TR/xquery-operators/">XQuery 1.0 and XPath 2.0 Functions and Operators</titleref>, W3C Working Draft. (See <xspecref href="http://www.w3.org/TR/xquery-operators/">http://www.w3.org/TR/xquery-operators</xspecref>.)
				</bibl>

<bibl id="xslt" key="XSLT">James Clark Ed., <titleref href="http://www.w3.org/TR/xslt">XSL Transformations (XSLT)</titleref>, W3C Recommendation. (See <xspecref href="http://www.w3.org/TR/xslt">http://www.w3.org/TR/xslt</xspecref>.)</bibl>

</blist>
		</div2>
	 </div1>

	 <inform-div1 id="sec-CharExamples"><head>Examples of Characters, Keystrokes and
		  Glyphs</head>
		<p id="exampleA6">A few examples will help make sense all this complexity
		  of text in computers (which is mostly a reflection of the complexity of human
		  writing systems). Let us start with a very simple example: a user, equipped
		  with a US-English keyboard, types <quote>Foo</quote>, which the computer
		  encodes as 16-bit values (the UTF-16 encoding of Unicode) and displays on the
		  screen.</p>

<figure>
<table border="1" cellpadding="5" cellspacing="0" summary="Table showing keystrokes, input characters, encoded characters and display for user typing Foo on a U.S. keyboard">
<tbody>
 <tr><th align="right">Keystrokes</th><td align="center">Shift-f</td><td align="center">o</td><td align="center">o</td></tr><tr><th align="right">Input characters</th><td align="center">F</td><td align="center">o</td><td align="center">o</td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td align="center">0046</td><td align="center">006F</td><td align="center">006F</td></tr><tr><th align="right">Display</th><td colspan="3" align="center">Foo</td></tr></tbody></table>
		<caption>Example: Basic Latin</caption></figure>
		<p>The only complexity here is the use of a modifier (Shift) to input the
		  capital <qchar>F</qchar>.</p>
		<p>A slightly more complex example is a user typing <qchar>çé</qchar> on
		  a traditional French-Canadian keyboard, which the computer again encodes in
		  UTF-16 and displays. We assume that this particular computer uses a fully
		  composed form of UTF-16.</p>

<figure>
<table border="1" cellpadding="5" cellspacing="0" summary="Table showing keystrokes, input characters, encoded characters and display for user typing çé on a French-Canadian keyboard">
<tbody><tr><th align="right">Keystrokes</th><td align="center">
				  ¸ </td><td align="center">c</td><td align="center">é</td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center">ç</td><td align="center">é</td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td colspan="2" align="center">00E7</td><td align="center">00E9</td></tr><tr><th align="right">Display</th><td colspan="3" align="center">çé</td></tr></tbody></table>
		<caption>Example: Latin with diacritics</caption></figure>
		<p>A few interesting things are happening here: when the user types the
		  cedilla (<qchar>¸</qchar>), nothing happens except for a change of state of the
		  keyboard driver; the cedilla is a <term>dead key</term>. When the driver gets
		  the c keystroke, it provides a complete <qchar>ç</qchar> character to the
		  system, which represents it as a single 16-bit <termref def="def-CEF">code unit</termref> and displays a
		  <qchar>ç</qchar> <termref def="def-glyph">glyph</termref>. The user then presses the dedicated <qchar>é</qchar>
		  key, which results in, again, a character represented by two bytes. Most
		  systems will display this as one glyph, but it is also possible to combine two
		  glyphs (the base letter and the accent) to obtain the same rendering.</p>
		<p>On to a Japanese example: our user employs a <term>romaji input
		  method</term> to type "<image><graphic source="images/nihongo.gif" width="47" height="16"/><alt>nihongo in Kanji characters</alt></image>", which the
		  computer encodes in UTF-16 and displays.</p>

<figure>
<table border="1" cellpadding="5" cellspacing="0" summary="Table showing keystrokes, input characters, encoded characters and display for user typing nihongo in a Japanese Romaji input method">
<tbody><tr><th align="right">Keystrokes</th><td align="center" colspan="4"> n i h o n g o &lt;space&gt;
				  &lt;return&gt;</td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center"><image><graphic source="images/ni.gif" width="14" height="16"/> <alt>kana character ni</alt></image></td><td align="center"><image><graphic source="images/hon.gif" width="15" height="16"/>
				  <alt>kana character hon</alt></image></td><td align="center"><image><graphic source="images/go.gif" width="16" height="16"/><alt>kana character go</alt></image></td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td colspan="2" align="center">65E5</td><td align="center">672C</td><td align="center">8A9E</td></tr><tr><th align="right">Display</th><td colspan="4" align="center"><image><graphic source="images/nihongo.gif" width="47" height="16"/><alt>nihongo in kanji characters</alt></image></td></tr></tbody></table>
		<caption>Example: Japanese</caption></figure>
		<p>The interesting aspect here is input: the user types Latin characters,
		  which are converted on the fly to kana (not shown here), and then to kanji when
		  the user requests conversion by pressing &lt;space&gt;; the kanji characters
		  are finally sent to the application when the user presses &lt;return&gt;. The
		  user has to type a total of nine keystrokes before the three characters are
		  produced, which are then encoded and displayed rather trivially.</p>
		<p>An Arabic example will show different phenomena:</p>

<figure>
<table border="1" cellpadding="5" cellspacing="0" summary="Table showing keystrokes, input characters, encoded characters and display for user typing on an Arabic keyboard">
<tbody><tr><th align="right">Keystrokes</th><td align="center" colspan="2"><image><graphic source="images/lam.gif" width="15" height="25"/><alt>Arabic lam</alt></image>
				  </td><td align="center"><image><graphic source="images/alif.gif" width="11" height="25"/><alt>Arabic alef</alt></image></td><td align="center" colspan="2"><image><graphic source="images/lamalif.gif" width="15" height="25"/> <alt>Arabic lam-alef</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/><alt>Arabic ghayn</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/> <alt>Arabic ghayn</alt></image></td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center"><image><graphic source="images/lam.gif" width="15" height="25"/>
				  <alt>Arabic lam</alt></image></td><td align="center"><image><graphic source="images/alif.gif" width="11" height="25"/> <alt>Arabic alef</alt></image></td><td align="center"><image><graphic source="images/lam.gif" width="15" height="25"/> <alt>Arabic lam</alt></image></td><td align="center"><image><graphic source="images/alif.gif" width="11" height="25"/><alt>Arabic alef</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/> <alt>Arabic ghayn</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/> <alt>Arabic ghayn</alt></image></td></tr><tr><th align="right">Encoded characters (byte
				  values in hex)</th><td colspan="2" align="center">0644</td><td align="center">0627</td><td align="center">0644</td><td align="center">0627</td><td align="center">0639</td><td align="center">0639</td></tr><tr><th align="right">Display</th><td colspan="7" align="center"><image><graphic source="images/arabe.gif" width="42" height="26"/> <alt>A few Arabic letters.</alt></image></td></tr></tbody></table>
		<caption>Example: Arabic</caption></figure>
		<p>Here the first two keystrokes each produce an input character and an
		  encoded character, but the pair is displayed as a single glyph
		  ('<image><graphic source="images/lamalif.gif" width="15" height="25"/>
		  <alt>Arabic lam-alef</alt></image>', a lam-alef ligature). The next keystroke
		  is a lam-alef, which some Arabic keyboards have; it produces the same two
		  characters which are displayed similarly, but this second lam-alef is placed to
		  the <emph>left</emph> of the first one when displayed. The last two keystrokes
		  produce two identical characters which are rendered by two different glyphs (a
		  medial form followed to its left by a final form). We thus have 5 keystrokes
		  producing 6 characters and 4 glyphs laid out right-to-left.</p>
		<p id="sec-CharExamplesA5">A final example in Tamil, typed with an ISCII
		  keyboard, will illustrate some additional phenomena:</p>

<figure>
<table border="1" cellpadding="5" cellspacing="0" summary="Table showing keystrokes, input characters, encoded characters and display for user typing on a Tamil ISCII keyboard">
<tbody><tr><th align="right">Keystrokes</th><td align="center" colspan="2"><image><graphic source="images/ta-tm.gif" width="17" height="18"/><alt>Tamil ta</alt></image>
				  </td><td align="center"><image><graphic source="images/a-tm.gif" width="17" height="17"/> <alt>Tamil aa</alt></image></td><td align="center"><image><graphic source="images/na-tm.gif" width="18" height="18"/><alt>Tamil na</alt></image></td><td align="center"><image><graphic source="images/virama-tm.gif" width="10" height="19"/><alt>Tamil virama</alt></image></td><td align="center"><image><graphic source="images/ka-tm.gif" width="15" height="19"/><alt>Tamil ka</alt></image></td><td align="center"><image><graphic source="images/o-tm.gif" width="24" height="19"/><alt>Tamil o</alt></image></td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center"><image><graphic source="images/ta-tm.gif" width="17" height="18"/> <alt>Tamil ta</alt></image></td><td align="center"><image><graphic source="images/a-tm.gif" width="17" height="17"/> <alt>Tamil aa</alt></image></td><td align="center"><image><graphic source="images/na-tm.gif" width="18" height="18"/><alt>Tamil na</alt></image></td><td align="center"><image><graphic source="images/virama-tm.gif" width="10" height="19"/><alt>Tamil virama</alt></image></td><td align="center"><image><graphic source="images/ka-tm.gif" width="15" height="19"/> <alt>Tamil ka</alt></image></td><td align="center"><image><graphic source="images/o-tm.gif" width="24" height="19"/><alt>Tamil o</alt></image></td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td colspan="2" align="center">0B9F</td><td align="center">0BBE</td><td align="center">0B99</td><td align="center">0BCD</td><td align="center">0B95</td><td align="center">0BCB</td></tr><tr><th align="right">Display</th><td colspan="7" align="center"><image><graphic source="images/tango.gif" width="77" height="23"/> <alt>Tango in Tamil letters.</alt></image></td></tr></tbody></table>
		<caption>Example: Tamil</caption></figure>
		<p>Here input is straightforward, but note that contrary to the preceding
		  accented Latin example, the diacritic '<image><graphic source="images/virama-tm.gif" width="10" height="19"/> <alt>Tamil virama</alt></image>' (<term>virama</term>, vowel killer) is entered
		  <emph>after</emph> the '<image><graphic source="images/na-tm.gif" width="18" height="18"/><alt>Tamil na</alt></image>' to which it applies. Rendering is
		  interesting for the last two characters. The last one ('<image><graphic source="images/o-tm.gif" width="24" height="19"/><alt>Tamil o</alt></image>')
		  clearly consists of two glyphs which <emph>surround</emph> the glyph of the
		  next to last character ('<image><graphic source="images/ka-tm.gif" width="15" height="19"/> <alt>Tamil ka</alt></image>').</p>



	 </inform-div1>


   <inform-div1 id="sec-ComposingChars"><head>Composing Characters</head>

    <p>As specified in <specref ref="sec-FullyNormalized"/>, a composing character is any character
that is
<olist>
 <item><p>the second character in the canonical decomposition mapping of some primary 
  composite (as defined in <loc href="http://www.unicode.org/unicode/reports/tr15/#D3">D3</loc> of
  <bibref ref="UTR15"/>), or</p></item>
 <item><p>of non-zero canonical combining class (as defined in <bibref ref="unicode"/>).</p></item>
</olist>
These two categories are highly but not exactly overlapping.  
The first category includes a few class-zero
     characters that <emph>do compose</emph> with a previous character in <termref def="sec-ChoiceNFC">NFC</termref>; this is the case for
     some vowel and length marks in Brahmi-derived scripts, as well as for the modern non-initial
     conjoining jamos of the Korean Hangul script.
The second category includes some combining characters that <emph>do not compose</emph> in NFC,
for the simple reason that there is no precomposed character involving them. They must nevertheless be
taken into account as composing characters because their presence may make reordering of combining 
marks necessary, to maintain normalization under concatenation or deletion.
Therefore, composing characters as defined in <specref ref="sec-FullyNormalized"/>
     include all characters of non-zero canonical combining class plus
     the following (as of Unicode 3.2):</p>

<figure>
 <table cellpadding="5" cellspacing="5" summary="Table of all composing but not combining characters">
  <thead>
   <tr>
    <th id="no">Unicode number</th><th id="char">Character</th><th id="name">Name</th>
   </tr>
  </thead>
  <tbody>
   <tr><th id="brahmi" colspan="3" align="left"><emph>Brahmi-derived scripts</emph></th></tr>
   <tr>
    <td headers="no brahmi">09BE</td><td headers="char brahmi"> া</td><td headers="name brahmi"><uname>BENGALI VOWEL SIGN AA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">09D7</td><td headers="char brahmi"> ৗ</td><td headers="name brahmi"><uname>BENGALI AU LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0B3E</td><td headers="char brahmi"> ା</td><td headers="name brahmi"><uname>ORIYA VOWEL SIGN AA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0B56</td><td headers="char brahmi"> ୖ</td><td headers="name brahmi"><uname>ORIYA AI LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0B57</td><td headers="char brahmi"> ୗ</td><td headers="name brahmi"><uname>ORIYA AU LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0BBE</td><td headers="char brahmi"> ா</td><td headers="name brahmi"><uname>TAMIL VOWEL SIGN AA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0BD7</td><td headers="char brahmi"> ௗ</td><td headers="name brahmi"><uname>TAMIL AU LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0CC2</td><td headers="char brahmi"> ೂ</td><td headers="name brahmi"><uname>KANNADA VOWEL SIGN UU</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0CD5</td><td headers="char brahmi"> ೕ</td><td headers="name brahmi"><uname>KANNADA LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0CD6</td><td headers="char brahmi"> ೖ</td><td headers="name brahmi"><uname>KANNADA AI LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0D3E</td><td headers="char brahmi"> ാ</td><td headers="name brahmi"><uname>MALAYALAM VOWEL SIGN AA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0D57</td><td headers="char brahmi"> ൗ</td><td headers="name brahmi"><uname>MALAYALAM AU LENGTH MARK</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0DCF</td><td headers="char brahmi"> ා</td><td headers="name brahmi"><uname>SINHALA VOWEL SIGN AELA-PILLA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0DDF</td><td headers="char brahmi"> ෟ</td><td headers="name brahmi"><uname>SINHALA VOWEL SING GAYANUKITTA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0FB5</td><td headers="char brahmi"> ྵ</td><td headers="name brahmi"><uname>TIBETAN SUBJOINED LETTER SSA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">0FB7</td><td headers="char brahmi"> ྷ</td><td headers="name brahmi"><uname>TIBETAN SUBJOINED LETTER HA</uname></td>
   </tr>
   <tr>
    <td headers="no brahmi">102E</td><td headers="char brahmi"> ီ</td><td headers="name brahmi"><uname>MYANMAR VOWEL SIGN II</uname></td>
   </tr>
   <tr><th id="jung" colspan="3" align="left"><emph>Hangul vowels</emph></th></tr>
   <tr>
    <td headers="no jung">1161</td><td headers="char jung"> ᅡ</td><td headers="name jung"><uname>HANGUL JUNGSEONG A</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1162</td><td headers="char jung"> ᅢ</td><td headers="name jung"><uname>HANGUL JUNGSEONG AE</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1163</td><td headers="char jung"> ᅣ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YA</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1164</td><td headers="char jung"> ᅤ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YAE</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1165</td><td headers="char jung"> ᅥ</td><td headers="name jung"><uname>HANGUL JUNGSEONG EO</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1166</td><td headers="char jung"> ᅦ</td><td headers="name jung"><uname>HANGUL JUNGSEONG E</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1167</td><td headers="char jung"> ᅧ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YEO</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1168</td><td headers="char jung"> ᅨ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YE</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1169</td><td headers="char jung"> ᅩ</td><td headers="name jung"><uname>HANGUL JUNGSEONG O</uname></td>
   </tr>
   <tr>
    <td headers="no jung">116A</td><td headers="char jung"> ᅪ</td><td headers="name jung"><uname>HANGUL JUNGSEONG WA</uname></td>
   </tr>
   <tr>
    <td headers="no jung">116B</td><td headers="char jung"> ᅫ</td><td headers="name jung"><uname>HANGUL JUNGSEONG WAE</uname></td>
   </tr>
   <tr>
    <td headers="no jung">116C</td><td headers="char jung"> ᅬ</td><td headers="name jung"><uname>HANGUL JUNGSEONG OE</uname></td>
   </tr>
   <tr>
    <td headers="no jung">116D</td><td headers="char jung"> ᅭ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YO</uname></td>
   </tr>
   <tr>
    <td headers="no jung">116E</td><td headers="char jung"> ᅮ</td><td headers="name jung"><uname>HANGUL JUNGSEONG U</uname></td>
   </tr>
   <tr>
    <td headers="no jung">116F</td><td headers="char jung"> ᅯ</td><td headers="name jung"><uname>HANGUL JUNGSEONG WEO</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1170</td><td headers="char jung"> ᅰ</td><td headers="name jung"><uname>HANGUL JUNGSEONG WE</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1171</td><td headers="char jung"> ᅱ</td><td headers="name jung"><uname>HANGUL JUNGSEONG WI</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1172</td><td headers="char jung"> ᅲ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YU</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1173</td><td headers="char jung"> ᅳ</td><td headers="name jung"><uname>HANGUL JUNGSEONG EU</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1174</td><td headers="char jung"> ᅴ</td><td headers="name jung"><uname>HANGUL JUNGSEONG YI</uname></td>
   </tr>
   <tr>
    <td headers="no jung">1175</td><td headers="char jung"> ᅵ</td><td headers="name jung"><uname>HANGUL JUNGSEONG I</uname></td>
   </tr>
 <tr><th id="jong" colspan="3" align="left"><emph>Hangul trailing consonants</emph></th></tr>
   <tr>
    <td headers="no jong">11A8</td><td headers="char jong"> ᆨ</td><td headers="name jong"><uname>HANGUL JONGSEONG KIYEOK</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11A9</td><td headers="char jong"> ᆩ</td><td headers="name jong"><uname>HANGUL JONGSEONG SSANGKIYEOK</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11AA</td><td headers="char jong"> ᆪ</td><td headers="name jong"><uname>HANGUL JONGSEONG KIYEOK-SIOS</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11AB</td><td headers="char jong"> ᆫ</td><td headers="name jong"><uname>HANGUL JONGSEONG NIEUN</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11AC</td><td headers="char jong"> ᆬ</td><td headers="name jong"><uname>HANGUL JONGSEONG NIEUN-CIEUC</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11AD</td><td headers="char jong"> ᆭ</td><td headers="name jong"><uname>HANGUL JONGSEONG NIEUN-HIEUH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11AE</td><td headers="char jong"> ᆮ</td><td headers="name jong"><uname>HANGUL JONGSEONG TIKEUT</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11AF</td><td headers="char jong"> ᆯ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B0</td><td headers="char jong"> ᆰ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-KIYEOK</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B1</td><td headers="char jong"> ᆱ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-MIEUM</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B2</td><td headers="char jong"> ᆲ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-PIEUP</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B3</td><td headers="char jong"> ᆳ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-SIOS</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B4</td><td headers="char jong"> ᆴ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-THIEUTH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B5</td><td headers="char jong"> ᆵ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-PHIEUPH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B6</td><td headers="char jong"> ᆶ</td><td headers="name jong"><uname>HANGUL JONGSEONG RIEUL-HIEUH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B7</td><td headers="char jong"> ᆷ</td><td headers="name jong"><uname>HANGUL JONGSEONG MIEUM</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B8</td><td headers="char jong"> ᆸ</td><td headers="name jong"><uname>HANGUL JONGSEONG PIEUP</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11B9</td><td headers="char jong"> ᆹ</td><td headers="name jong"><uname>HANGUL JONGSEONG PIEUP-SIOS</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11BA</td><td headers="char jong"> ᆺ</td><td headers="name jong"><uname>HANGUL JONGSEONG SIOS</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11BB</td><td headers="char jong"> ᆻ</td><td headers="name jong"><uname>HANGUL JONGSEONG SSANGSIOS</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11BC</td><td headers="char jong"> ᆼ</td><td headers="name jong"><uname>HANGUL JONGSEONG IEUNG</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11BD</td><td headers="char jong"> ᆽ</td><td headers="name jong"><uname>HANGUL JONGSEONG CIEUC</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11BE</td><td headers="char jong"> ᆾ</td><td headers="name jong"><uname>HANGUL JONGSEONG CHIEUCH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11BF</td><td headers="char jong"> ᆿ</td><td headers="name jong"><uname>HANGUL JONGSEONG KHIEUKH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11C0</td><td headers="char jong"> ᇀ</td><td headers="name jong"><uname>HANGUL JONGSEONG THIEUTH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11C1</td><td headers="char jong"> ᇁ</td><td headers="name jong"><uname>HANGUL JONGSEONG PHIEUPH</uname></td>
   </tr>
   <tr>
    <td headers="no jong">11C2</td><td headers="char jong"> ᇂ</td><td headers="name jong"><uname>HANGUL JONGSEONG HIEUH</uname></td>
   </tr>
  </tbody>
 </table>
</figure>

   <note>
    <p>The characters in the second column of the above table may or
    may not appear, or may appear as blank rectangles, depending on the capabilities
    of your browser and on the fonts installed in your system.</p>
   </note>

   </inform-div1>


   <inform-div1 id="sec-n11n-resources"><head>Resources for Normalization</head>
    <p>The following are freely available programming resources related to normalization:</p>
    <ulist>
     <item><p>Charlint 
      (<loc href="http://www.w3.org/International/charlint/">http://www.w3.org/International/charlint/</loc>),
      in perl and written more for clarity than efficiency, in particular because it reads in the whole Unicode
      data file before doing anything.</p></item>
     <item><p>Normalization Demo
      (<loc href="http://www.unicode.org/unicode/reports/tr15/Normalizer.html">http://www.unicode.org/unicode/reports/tr15/Normalizer.html</loc>),
      a small demo working on a subset of base and combining characters.</p></item>
     <item><p>ICU (<loc href="http://oss.software.ibm.com/icu/userguide/normalization.html">http://oss.software.ibm.com/icu/userguide/normalization.html</loc>).</p></item>
     <item><p>Unicode::Normalize (<loc href="http://homepage1.nifty.com/nomenclator/perl/Unicode-Normalize.html">http://homepage1.nifty.com/nomenclator/perl/Unicode-Normalize.html</loc>),
      a perl module.</p></item>
    </ulist>
   </inform-div1>
   
   <inform-div1 id="sec-Acknowledgements"><head>Acknowledgements</head>
    <p>Special thanks go to Ian Jacobs for ample help with editing. Tim
      Berners-Lee and James Clark provided important details in the section on URIs.
      The W3C I18N WG and IG, as well as others, provided many comments and
      suggestions.</p>
   </inform-div1>


	 <inform-div1 id="sec-Changes"><head>Change Log</head>
		<div2 id="sec-sec-Changes"><head>Changes since
			 <loc href="http://www.w3.org/TR/2002/WD-charmod-20020220/">http://www.w3.org/TR/2002/WD-charmod-20020220</loc></head>

		  <p>Numerous additional changes listed in <loc href="http://www.w3.org/International/Group/charmod-lc/">Character Model for the World Wide Web 1.0 Last Call Comments</loc> (Members only).</p><p>Completed rework of <specref ref="sec-Normalization"/>.</p>


		</div2>

	 </inform-div1>
  </back></spec>
