<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE spec SYSTEM "xmlspec-CM.dtd">
<?xml-stylesheet type="text/xsl" href="xmlspec-CM.xsl"?>
<spec w3c-doctype="wd"><header><title id="theTitle">Character Model for the World Wide Web 1.0</title> 
	 <w3c-designation>WD</w3c-designation><w3c-doctype id="w3cDocType">W3C Working Draft</w3c-doctype> 
	 <pubdate><day>20</day><month>December</month><year>2001</year> 
	 </pubdate> 
	 <publoc><loc href="http://www.w3.org/TR/2001/WD-charmod-20011220/">http://www.w3.org/TR/2001/WD-charmod-20011220</loc>
		(available in<loc href="http://www.w3.org/TR/2001/WD-charmod-20011220/Overview.xml"> XML</loc>, 
		<loc href="http://www.w3.org/TR/2001/WD-charmod-20011220/">HTML</loc>,
		and as a 
		<loc href="http://www.w3.org/TR/2001/WD-charmod-20011220/charmod.zip">Zip
		  archive</loc>) 
	 </publoc> 
	 <latestloc><loc href="http://www.w3.org/TR/charmod/">http://www.w3.org/TR/charmod</loc>
		
	 </latestloc> 
	 <prevlocs><loc href="http://www.w3.org/TR/2001/WD-charmod-20010928/">http://www.w3.org/TR/2001/WD-charmod-20010928</loc>
		
	 </prevlocs> 
	 <authlist> 
		<author><name>Martin J. Dürst</name><affiliation>W3C</affiliation><email href="mailto:duerst@w3.org">duerst@w3.org</email> 
		</author> 
		<author><name>François Yergeau</name><affiliation>Alis
			 Technologies</affiliation> 
		</author> 
		<author><name>Richard Ishida</name><affiliation>Xerox,
			 GKLS</affiliation><email href="mailto:richard.ishida@gbr.xerox.com">richard.ishida@gbr.xerox.com</email>
		  
		</author> 
		<author><name>Misha Wolf</name><affiliation>Reuters
			 Ltd.</affiliation><email href="mailto:misha.wolf@reuters.com">misha.wolf@reuters.com</email> 
		</author> 
		<author><name>Asmus Freytag</name><affiliation>ASMUS,
			 Inc.</affiliation><email href="mailto:asmus@unicode.org">asmus@unicode.org</email> 
		</author> 
		<author><name>Tex Texin</name><affiliation>Progress Software
			 Corp.</affiliation><email href="mailto:texin@progress.com">texin@progress.com</email> 
		</author> 
	 </authlist> 
	 <abstract id="abstract"> 
		<p>This Architectural Specification provides authors of specifications,
		  software developers, and content developers with a common reference for
		  interoperable text manipulation on the World Wide Web. Topics addressed include
		  encoding identification, early uniform normalization, string identity matching,
		  string indexing, and URI conventions, building on the Universal Character Set,
		  defined jointly by Unicode and ISO/IEC 10646. Some introductory material on
		  characters and character encodings is also provided.</p> 
	 </abstract> 
	 <status id="status"> 
		<p><emph>This section describes the status of this document at the time
		  of its publication. Other documents may supersede this document. The latest
		  status of this series of documents is maintained at the W3C.</emph></p> 
		<p>This is a W3C Working Draft published between the 
		  <loc href="http://www.w3.org/TR/2001/WD-charmod-20010126/">first Last
			 Call Working Draft of 26 January 2001</loc> and a planned second Last Call.
		  This interim publication is used to document the further progress made on
		  addressing the comments received during the first Last Call. A list of last
		  call comments with their status can be found in the 
		  <loc href="/International/Group/charmod-lc/">disposition of
			 comments</loc> (<loc href="http://cgi.w3.org/MemberAccess/AccessRequest">Members only</loc>).</p> 
		<p>Work is still ongoing on addressing the comments received during the
		  first Last Call. We do not encourage comments on this Working Draft; instead we
		  ask reviewers to wait for the second Last Call. We will announce the second
		  Last Call on the W3C Internationalization public mailing list (<loc href="/International/O-misc-mlists.html">www-international@w3.org</loc>; 
		  <loc href="mailto:www-international-request@w3.org?subject=subscribe">subscribe</loc>).
		  Comments from the public and from organizations outside the W3C may be sent to 
		  <loc href="mailto:www-i18n-comments@w3.org">www-i18n-comments@w3.org</loc> (<loc href="http://lists.w3.org/Archives/Public/www-i18n-comments/">archive</loc>).
		  Comments from W3C Working Groups may be sent directly to the
		  Internationalization Interest Group (w3c-i18n-ig@w3.org), with cross-posting to
		  the originating Group, to facilitate discussion and resolution.</p> 
		<p>Due to the architectural nature of this document, it affects a large
		  number of W3C Working Groups, but also software developers, content developers,
		  and writers and users of specifications outside the W3C that have to interface
		  with W3C specifications.</p> 
		<p>This document is published as part of the 
		  <loc href="http://www.w3.org/International/Activity">W3C
			 Internationalization Activity</loc> by the 
		  <loc href="/International/Group/">Internationalization Working
			 Group</loc> (<loc href="http://cgi.w3.org/MemberAccess/AccessRequest">Members
			 only</loc>), with the help of the Internationalization Interest Group. The
		  Internationalization Working Group will not allow early implementation to
		  constrain its ability to make changes to this specification prior to final
		  release. Publication as a Working Draft does not imply endorsement by the W3C
		  Membership. It is inappropriate to use W3C Working Drafts as reference material
		  or to cite them as other than "work in progress". A list of current <loc href="http://www.w3.org/TR/">W3C Recommendations and other technical documents</loc> can be found at <loc href="http://www.w3.org/TR/">http://www.w3.org/TR</loc>.</p> 
		<p>For information about the requirements that informed the development
		  of important parts of this specification, see <titleref>Requirements for String
		  Identity Matching and String Indexing</titleref> <bibref ref="CharReq"/>.</p> 
	 </status><langusage><language id="en">en</language></langusage> 
	 <revisiondesc> 
		<p>lkj</p> 
	 </revisiondesc></header> 
  <body> 
	 <div1 id="sec-Intro"><head>Introduction</head> 
		<div2 id="sec-GoalsScope"> 
		  <head>Goals and Scope</head> 
		  <p>The goal of this document is to facilitate use of the Web by all
			 people, regardless of their language, script, writing system, and cultural
			 conventions, in accordance with the
			 <titleref href="http://www.w3.org/Consortium/#goals">W3C goal of universal
			 access</titleref>. One basic prerequisite to achieve this goal is to be able to
			 transmit and process the characters used around the world in a well-defined and
			 well-understood way.</p> 
		  <p>The main target audience of this document is W3C specification
			 developers. This document defines conformance requirements for other W3C
			 specifications. This document and parts of it can also be referenced from other
			 W3C specifications.</p> 
		  <p>Other audiences of this document include software developers,
			 content developers, and authors of specifications outside the W3C. Software
			 developers and content developers implement and use W3C specifications. This
			 document defines some conformance requirements for software developers and
			 content developers that implement and use W3C specifications. It also helps
			 software developers and content developers to understand the character-related
			 provisions in other W3C specifications.</p> 
		  <p>The character model described in this document provides authors of
			 specifications, software developers, and content developers with a common
			 reference for consistent, interoperable text manipulation on the World Wide
			 Web. Working together, these three groups can build a more international
			 Web.</p> 
		  <p>Topics addressed include encoding identification, early uniform
			 normalization, string identity matching, string indexing, and URI conventions.
			 Some introductory material on characters and character encodings is also
			 provided.</p> 
		  <p>Topics not addressed or barely touched include collation (sorting),
			 fuzzy matching and language tagging. Some of these topics may be addressed in a
			 future version of this specification.</p> 
		  <p>At the core of the model is the Universal Character Set (UCS),
			 defined jointly by The Unicode Standard <bibref ref="unicode"/> and ISO/IEC
			 10646 <bibref ref="iso10646"/>. In this document, <term>Unicode</term> is used
			 as a synonym for the Universal Character Set. The model will allow Web
			 documents authored in the world's scripts (and on different platforms) to be
			 exchanged, read, and searched by Web users around the world.</p> 
		  <p>All W3C specifications must conform to this document (see section
			 <specref ref="sec-Conformance"/>). Authors of other specifications (for
			 example, IETF specifications) are strongly encouraged to take guidance from
			 it.</p> 
		  <p>Since other W3C specifications will be based on some of the
			 provisions of this document, without repeating them, software developers
			 implementing W3C specifications must conform to these provisions.</p> 
		</div2> 
		<div2 id="sec-Background"> 
		  <head>Background</head> 
		  <p>This section provides some historical background on the topics
			 addressed in this document.</p> 
		  <p>Starting with <titleref>Internationalization of the Hypertext Markup
			 Language</titleref> <bibref ref="RFC2070"/>, the Web community has recognized
			 the need for a character model for the World Wide Web. The first step towards
			 building this model was the adoption of Unicode as the document character set
			 for HTML.</p> 
		  <p>The choice of Unicode was motivated by the fact that Unicode: 
			 <ulist> 
				<item> 
				  <p>is the only universal character repertoire available,</p> 
				</item> 
				<item> 
				  <p>covers the widest possible range,</p> 
				</item> 
				<item> 
				  <p>provides a way of referencing characters independent of the
					 encoding of a resource,</p> 
				</item> 
				<item> 
				  <p>is being updated/completed carefully,</p> 
				</item> 
				<item> 
				  <p>is widely accepted and implemented by industry.</p> 
				</item> 
			 </ulist></p> 
		  <p>W3C adopted Unicode as the document character set for HTML in
			 <bibref ref="html40"/>. The same approach was later used for specifications
			 such as XML 1.0 <bibref ref="xml10"/> and CSS2 <bibref ref="css2"/>. Unicode
			 now serves as a common reference for W3C specifications and applications.</p> 
		  <p>The IETF has adopted some policies on the use of character sets on
			 the Internet (see <bibref ref="rfc2277"/>).</p> 
		  <p>When data transfer on the Web remained mostly unidirectional (from
			 server to browser), and where the main purpose was to render documents, the use
			 of Unicode without specifying additional details was sufficient. However, the
			 Web has grown: 
			 <ulist> 
				<item> 
				  <p>Data transfers among servers, proxies, and clients, in all
					 directions, have increased.</p> 
				</item> 
				<item> 
				  <p>Non-ASCII characters <bibref ref="MIME"/> are being used in
					 more and more places.</p> 
				</item> 
				<item> 
				  <p>Data transfers between different protocol/format elements
					 (such as element/attribute names, URI components, and textual content) have
					 increased.</p> 
				</item> 
				<item> 
				  <p>More and more APIs are defined, not just protocols and
					 formats.</p> 
				</item> 
			 </ulist></p> 
		  <p>In short, the Web may be seen as a single, very large application
			 (see <bibref ref="Nicol"/>), rather than as a collection of small independent
			 applications.</p> 
		  <p>While these developments strengthen the requirement that Unicode be
			 the basis of a character model for the Web, they also create the need for
			 additional specifications on the application of Unicode to the Web. Some
			 aspects of Unicode that require additional specification for the Web include: 
			 <ulist> 
				<item> 
				  <p>Choice of encoding forms (UTF-8, UTF-16, UTF-32).</p> 
				</item> 
				<item> 
				  <p>Counting characters, measuring string length in the presence
					 of variable-length encodings and combining characters).</p> 
				</item> 
				<item> 
				  <p>Duplicate encodings (e.g. precomposed vs decomposed).</p> 
				</item> 
				<item> 
				  <p>Use of control codes for various purposes (e.g.
					 bidirectionality control, symmetric swapping, etc.).</p> 
				</item> 
			 </ulist></p> 
		  <p>It should be noted that such properties also exist in legacy
			 encodings (where <term>legacy encoding</term> is taken to mean any character
			 encoding not based on Unicode), and in many cases have been inherited by
			 Unicode in one way or another from such legacy encodings.</p> 
		  <p>The remainder of this document presents additional specifications
			 and requirements to ensure an interoperable character model for the Web, taking
			 into account earlier work (from W3C, ISO and IETF).</p> 
		</div2> 
		<div2 id="sec-Notation"><head>Terminology and Notation</head> 
		  <p>For the purpose of this specification, the <term>producer</term> of
			 text data is the sender of the data in the case of protocols, and the tool that
			 produces the data in the case of formats. The <term>recipient</term> of text
			 data is the software module that receives the data.</p> 
		  <note> 
			 <p>A software module may be both a recipient and a producer.</p> 
		  </note> 
		  <p>Unicode code points are denoted as U+hhhh, where "hhhh" is a
			 sequence of at least four, and at most six hexadecimal digits.</p> 
		</div2> 
	 </div1> 
	 <div1 id="sec-Conformance"><head>Conformance</head> 
		<p>In this document, requirements are expressed using the key words
		  "<rfc2119>MUST</rfc2119>", "<rfc2119>MUST NOT</rfc2119>",
		  "<rfc2119>REQUIRED</rfc2119>", "<rfc2119>SHALL</rfc2119>" and "<rfc2119>SHALL
		  NOT</rfc2119>". Recommendations are expressed using the key words
		  "<rfc2119>SHOULD</rfc2119>", "<rfc2119>SHOULD NOT</rfc2119>" and
		  "<rfc2119>RECOMMENDED</rfc2119>". "<rfc2119>MAY</rfc2119>" and
		  "<rfc2119>OPTIONAL</rfc2119>" are used to indicate optional features or
		  behaviour. These keywords are used in accordance with RFC 2119
		  <bibref ref="rfc2119"/>.</p> 
		<p>This specification places conformance requirements on specifications,
		  on software and on Web content. To aid the reader, all requirements are
		  preceded by '[X]' where 'X' is one of 'S' for specifications, 'I' for software
		  implementations, and 'C' for Web content. These markers indicate the relevance
		  of the requirement and allow the reader to quickly locate relevant requirements
		  using the browser's search function.
		  <req><req-type>S</req-type><req-type>I</req-type><req-type>C</req-type><req-text>In
		  order to conform to this document, specifications <rfc2119>MUST NOT</rfc2119>
		  violate any requirements preceded by [S], software <rfc2119>MUST NOT</rfc2119>
		  violate any requirements preceded by [I], and content <rfc2119>MUST
		  NOT</rfc2119> violate any requirements preceded by [C].</req-text></req></p> 
		<p><req><req-type>S</req-type><req-text>Every W3C specification
		  <rfc2119>MUST</rfc2119>:</req-text> 
		  <olist> 
			 <item> 
				<p>conform to the requirements applicable to specifications,</p> 
			 </item> 
			 <item> 
				<p>specify that implementations <rfc2119>MUST</rfc2119> conform to
				  the requirements applicable to software, and</p> 
			 </item> 
			 <item> 
				<p>specify that content created according to that specification
				  <rfc2119>MUST</rfc2119> conform to the requirements applicable to content.</p> 
			 </item> 
		  </olist></req> </p> 
		<p><req><req-type>S</req-type><req-text>If an existing W3C specification
		  does not conform to the requirements in this document, then the next version of
		  that specification <rfc2119>MUST</rfc2119> be modified in order to
		  conform.</req-text></req></p> 
		<p><req><req-type>I</req-type><req-text>Where this specification contains
		  a procedural description, it <rfc2119>MUST</rfc2119> be understood as a way to
		  specify the desired external behavior. Implementations <rfc2119>MAY</rfc2119>
		  use other ways of achieving the same results, as long as observable behavior is
		  not affected.</req-text></req></p> 
	 </div1> 
	 <div1 id="sec-Characters"><head>Characters</head> 
		<div2 id="sec-Perceptions"><head>Perceptions of Characters</head> 
		  <div3 id="sec-PerceptionsIntro"><head>Introduction</head> 
			 <p>The glossary entry in <bibref ref="unicode30"/> gives:</p> 
			 <p><quote>Character. (1) The smallest component of written language
				that has semantic values; refers to the abstract meaning and/or shape
				...</quote></p> 
			 <p>The word <qterm>character</qterm> is used in many contexts, with
				different meanings. Human cultures have radically differing writing systems,
				leading to radically differing concepts of a character. Such wide variation in
				end user experience can, and often does, result in misunderstanding. This
				variation is sometimes mistakenly seen as the consequence of imperfect
				technology. Instead, it derives from the great flexibility and creativity of
				the human mind and the long tradition of writing as an important part of the
				human cultural heritage. The alphabetic approach used by scripts such as Latin,
				Cyrillic and Greek is only one of several possibilities.</p><example><p>Japanese
			 hiragana and katakana are syllabaries. A character in these scripts corresponds
			 to a syllable (usually a combination of consonant plus vowel).</p></example> <example><p>Korean Hangul is a featural syllabary that combines symbols for
			 individual sounds of the language into square syllabic blocks. Depending on the
			 user and the application, either the individual symbols or the syllabic
			 clusters can be considered to be characters.</p></example> <example><p>Indic scripts
			 are abugidas. Each consonant letter carries an inherent vowel that is
			 eliminated or replaced using semi-regular or irregular ways to combine
			 consonants and vowels into clusters. Depending on the user and the application,
			 either individual consonants or vowels, or the consonant or consonant-vowel
			 clusters can be perceived as characters.</p></example><example><p>Arabic script is
			 an example of an abjad. Short vowel sounds are typically not written at all.
			 When they are written they are indicated by the use of combining marks placed
			 above and below the consonantal letters.</p></example> 
			 <p>The developers of W3C specifications, and the developers of
				software based on those specifications, are likely to be more familiar with
				usages they have experienced and less familiar with the wide variety of usages
				in an international context. Furthermore, within a computing context,
				characters are often confused with related concepts, resulting in incomplete or
				inappropriate specifications and software.</p> 
			 <p>This section examines some of these contexts, meanings and
				confusions.</p> 
		  </div3> 
		  <div3 id="sec-WritingSystem"><head>Units of Aural Rendering</head> 
			 <p>In some scripts, characters have a close relationship to phonemes
				(a <term>phoneme</term> is a minimally distinct sound in the context of a
				particular spoken language), while in others they are closely related to
				meanings. Even when characters (loosely) correspond to phonemes, this
				relationship may not be simple, and there is rarely a one-to-one correspondence
				between character and phoneme.</p><example><p>In the English sentence,
			 <quote>They were too close to the door to close it.</quote> the same character
			 <qchar>s</qchar> is used to represent both /s/ and /z/ phonemes.</p></example> <example><p>In many scripts a single character may represent a sequence of
			 phonemes, such as the syllabic characters of Japanese hiragana.</p></example> <example><p>In many writing systems a sequence of characters may represent a
			 single phoneme, for example <qchar>wr</qchar> and <qchar>ng</qchar> in
			 <quote>writing</quote>.</p></example> 
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume that there is a one-to-one
				correspondence between characters and the sounds of a
				language.</req-text></req></p> 
		  </div3> 
		  <div3 id="sec-VisualRenderingUnits"><head>Units of Visual
				Rendering</head> 
			 <p>Visual rendering introduces the notion of a <emph>glyph</emph>.
				<term>Glyphs</term> are defined by ISO/IEC 9541-1 <bibref ref="iso9541"/> as
				<quote>a recognizable abstract graphic symbol which is independent of a
				specific design</quote>. There is <emph>not</emph> a one-to-one correspondence
				between characters and glyphs: 
				<ulist> 
				  <item> 
					 <p>A single character can be represented by multiple glyphs
						(each glyph is then part of the representation of that character). These glyphs
						may be physically separated from one another. </p> 
				  </item> 
				  <item> 
					 <p>A single glyph may represent a sequence of characters (this
						is the case with ligatures, among others).</p> 
				  </item> 
				  <item> 
					 <p>A character may be rendered with very different glyphs
						depending on the context.</p> 
				  </item> 
				  <item> 
					 <p>A single glyph may represent different characters (e.g.
						capital Latin A, capital Greek A and capital Cyrillic A).</p> 
				  </item> 
				</ulist></p> 
			 <p>Each glyph can be represented by a number of different glyph
				images; a set of glyph images makes up a <term>font</term>. Glyphs can be
				construed as the basic units of organization of the visual rendering of text,
				just as characters are the basic unit of organization of encoded text.</p> 
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume a one-to-one mapping between
				character codes and units of displayed text.</req-text></req></p> 
			 <p>See <specref ref="sec-CharExamples"/> for examples of the
				complexities of character to glyph mapping.</p> 
			 <p>Some scripts, in particular Arabic and Hebrew, are written from
				right to left. Text including characters from these scripts can run in both
				directions and is therefore called bidirectional text (see example 
				<loc href="#exampleA6">A.6</loc> in Appendix A). The Unicode
				Standard <bibref ref="unicode"/> requires that characters be stored and
				interchanged in logical order. <req><req-type>S</req-type><req-text>Protocols,
				data formats and APIs <rfc2119>MUST</rfc2119> store, interchange or process
				text data in logical order.</req-text></req></p> 
			 <p id="sec-CharExamplesA8">In the presence of bidi text, two possible
				selection modes must be considered. The first is <term>logical selection
				mode</term>, which selects all the characters <emph>logically</emph> located
				between the end-points of the user's mouse gesture. Here the user selects from
				between the first and second letters of the second word to the middle of the
				number. Logical selection looks like this:</p> <figure><table border="1" cellspacing="0" cellpadding="5"><tbody><tr><th>In memory</th><td align="center"><image><graphic source="images/logSelMemory.gif" width="323" height="27"/><alt>In the example used, logical selection is depicted as one highlighted range of characters in memory.</alt></image></td></tr><tr><th>On screen</th><td align="center"><image><graphic source="images/logSelScreen.gif" width="144" height="32"/><alt>The same example, showing highlighted on screen text shows two highlighted ranges of characters.</alt></image></td></tr></tbody></table></figure> 
			 <p>It is a consequence of the bidirectionality of the text that a
				single, continuous logical selection in memory results in a <emph>discontinuous
				selection appearing on the screen</emph>. This discontinuity, as well as the
				somewhat unintuitive behavior of the cursor, makes some users prefer a
				<term>visual selection mode</term>, which selects all the characters
				<emph>visually</emph> located between the end-points of the user's mouse
				gesture. With the same mouse gesture as before, we now obtain:</p> 
			 <figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th>In
						memory</th><td align="center"><image><graphic source="images/visSelMemory.gif" width="343" height="27"/><alt>In the example used, visual selection is depicted as two highlighted ranges of characters in memory.</alt></image></td></tr><tr><th>On screen</th><td align="center"><image><graphic source="images/visSelScreen.gif" width="141" height="33"/><alt>The same example again, now shows visual selection on screen highlighting a single range of characters.</alt></image></td></tr></tbody></table></figure> 
			 <p>In this mode, a single visual selection range results in
				<emph>two</emph> logical ranges, which have to be accommodated by protocols,
				APIs and implementations.</p> 
			 <p><req><req-type>S</req-type><req-text>Specifications of protocols
				and APIs that involve selection of ranges <rfc2119>SHOULD</rfc2119> provide for
				discontiguous selections, at least to the extent necessary to support
				implementation of visual selection on screen on top of those protocols and
				APIs.</req-text></req></p> 
		  </div3> 
		  <div3 id="sec-InputUnits"><head>Units of Input</head> 
			 <p>In keyboard input, it is <emph>not</emph> always the case that
				keystrokes and input characters correspond one-to-one. A limited number of keys
				can fit on a keyboard. Some keyboards will generate multiple characters from a
				single keypress. In other cases (<qterm>dead keys</qterm>) a key will generate
				no characters, but affect the results of subsequent keypresses. Many writing
				systems have far too many characters to fit on a keyboard and must rely on more
				complex <term>input methods</term>, which transform keystroke sequences into
				character sequences. Other languages may make it necessary to input some
				characters with special modifier keys. See <specref ref="sec-CharExamples"/>
				for examples of non-trivial input.</p> 
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume that a single keystroke results
				in a single character, nor that a single character can be input with a single
				keystroke (even with modifiers), nor that keyboards are the same all over the
				world.</req-text></req></p> 
		  </div3> 
		  <div3 id="sec-CollationUnits"><head>Units of Collation</head> 
			 <p>String comparison as used in sorting and searching is based on
				units which do not in general have a one-to-one relationship to encoded
				characters. Such string comparison can aggregate a character sequence into a
				single <term>collation unit</term> with its own position in the sorting order,
				can separate a single character into multiple collation units, and can
				distinguish various aspects of a character (case, presence of diacritics, etc.)
				to be sorted separately (multi-level sorting).</p> 
			 <p>In addition, a certain amount of pre-processing may also be
				required, and in some scripts (such as Japanese and Arabic) sort order is
				governed by higher order factors such as phonetics or word roots. Collation
				methods may also vary by application (e.g. dictionaries may be sorted
				differently than telephone books).</p><example><p>In traditional Spanish sorting, the letter sequences
			 <qchar>ch</qchar> and <qchar>ll</qchar> are treated as atomic collation units.
			 Although Spanish sorting, and to some extent Spanish everyday use, treat
			 <qchar>ch</qchar> as a single unit, current digital encodings treat it as two
			 letters, and keyboards do the same (the user types <qchar>c</qchar>, then
			 <qchar>h</qchar>).</p></example><example><p>In most languages, the letter
			 <qchar>æ</qchar> is sorted as two consecutive collation units: <qchar>a</qchar>
			 and <qchar>e</qchar>.</p></example><example><p>The sorting of text written in a
			 bicameral script (i.e. a script which has distinct upper and lower case
			 letters) is usually required to ignore case differences in a first pass; case
			 is then used to break ties in a later pass.</p></example><example><p>Treatment of
			 accented letters in sorting is dependent on the script or language in question.
			 The letter <qchar>ö</qchar> is treated as a modified <qchar>o</qchar> in
			 French, but as a letter completely independent from <qchar>o</qchar> (and
			 sorting after <qchar>z</qchar>) in Swedish. In German certain applications
			 treat the letter <qchar>o</qchar> as if it were the sequence
			 <qchar>oe</qchar>.</p></example><example><p>In Thai the sequence U+0E44 U+0E01 must
			 be sorted as if it was written U+0E01 U+0E44. Reordering is typically done
			 during an initial pre-processing stage.</p></example> 
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Software
				that sorts or searches text for users <rfc2119>MUST</rfc2119> do so on the
				basis of appropriate collation units and ordering rules for the relevant
				language and/or application.</req-text></req></p> 
		  </div3> 
		  <div3 id="sec-Storage"><head>Units of Storage</head> 
			 <p>Computer storage and communication rely on units of physical
				storage and information interchange, such as bits and bytes (also known as
				octets, as nowadays the word bytes is generally considered to mean 8-bit
				bytes). A frequent error in specifications and implementations is the equating
				of characters with units of physical storage. The mapping between characters
				and such units of storage is actually quite complex, and is discussed in the
				next section, <specref ref="sec-Digital"/>.</p> 
			 <p><req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications
				and software <rfc2119>MUST NOT</rfc2119> assume a one-to-one relationship
				between characters and units of physical storage.</req-text></req></p> 
		  </div3> 
		  <div3 id="sec-PerceptionsOutro"><head>Summary</head> 
			 <p>The term <term>character</term> is used differently in a variety
				of contexts and often leads to confusion when used outside of these contexts.
				In the context of the digital representations of text, a character can be
				defined informally as a small logical unit of text. <term>Text</term> is then
				defined as sequences of characters. While such an informal definition is
				sufficient to create or capture a common understanding in many cases, it is
				also sufficiently open to create misunderstandings as soon as details start to
				matter. In order to write effective specifications, protocol implementations,
				and software for end users, it is very important to understand that these
				misunderstandings can occur.</p> 
			 <p><req><req-type>S</req-type><req-text>When specifications use the
				term <qterm>character</qterm> it <rfc2119>MUST</rfc2119> be clear which of the
				possible meanings they intend.</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>SHOULD</rfc2119>
				avoid the use of the term <qterm>character</qterm> if a more specific term is
				available.</req-text></req></p> 
		  </div3> 
		</div2> 
		<div2 id="sec-Digital"><head>Digital Encoding of Characters</head> 
		  <p>To be of any use in computers, in computer communications and in
			 particular on the World Wide Web, characters must be encoded. In fact, much of
			 the information processed by computers over the last few decades has been
			 encoded text, exceptions being images, audio, video and numeric data. To
			 achieve text encoding, a large variety of encoding schemes have been devised,
			 which can loosely be defined as mappings between the character sequences that
			 users manipulate and the sequences of bits that computers manipulate.</p> 
		  <p>Given the complexity of text encoding and the large variety of
			 schemes for character encoding invented throughout the computer age, a more
			 formal description of the encoding process is useful. The process of defining a
			 text encoding can be described as follows (see <bibref ref="UTR17"/> for a more
			 detailed description): 
			 <olist> 
				<item id="def-repertoire"> 
				  <p>A set of characters to be encoded is identified. The
					 characters are pragmatically chosen to express text and to efficiently allow
					 various text processes in one or more target languages. They may not correspond
					 precisely to what users perceive as letters and other characters. The set of
					 characters is called a <term>repertoire</term>.</p> 
				</item> 
				<item id="def-CCS"> 
				  <p>Each character in the repertoire is then associated with a
					 (mathematical, abstract) non-negative integer, the <term>code point</term>
					 (also known as a <term>character number</term> or <term>code position</term>).
					 The result, a mapping from the repertoire to the set of non-negative integers,
					 is called a <term>coded character set (CCS)</term>.</p> 
				</item> 
				<item id="def-CEF"> 
				  <p>To enable use in computers, a suitable base datatype is
					 identified (such as a byte, a 16-bit unit of storage or other) and a
					 <term>character encoding form (CEF)</term> is used, which encodes the abstract
					 integers of a <acronym title="Coded Character Set">CCS</acronym> into sequences
					 of the <term>code units</term> of the base datatype. The encoding form can be
					 extremely simple (for instance, one which encodes the integers of the
					 <acronym title="Coded Character Set">CCS</acronym> into the natural
					 representation of integers of the chosen datatype of the computing platform) or
					 arbitrarily complex (a variable number of code units, where the value of each
					 unit is a non-trivial function of the encoded integer). </p> 
				</item> 
				<item id="def-CES"> 
				  <p>To enable transmission or storage using byte-oriented devices,
					 a <term>serialization scheme</term> or <term>character encoding scheme
					 (CES)</term> is next used. A <acronym title="Character Encoding Scheme">CES</acronym> is a mapping of the code units
					 of a <acronym title="Character Encoding Form">CEF</acronym> into well-defined
					 sequences of bytes, taking into account the necessary specification of
					 byte-order for multi-byte base datatypes and including in some cases switching
					 schemes between the code units of multiple
					 <acronym title="Character Encoding Scheme">CES</acronym>es (an example is ISO
					 2022). A <acronym title="Character Encoding Scheme">CES</acronym>, together
					 with the <acronym title="Coded Character Set">CCS</acronym>es it is used with,
					 is identified by an <acronym title="Internet Assigned Numbers Authority">IANA</acronym> charset identifier.
					 Given a sequence of bytes representing text and a <kw>charset</kw> identifier,
					 one can in principle unambiguously recover the sequence of characters of the
					 text.</p> 
				</item> 
			 </olist></p> 
		  <note> 
			 <p>The term <qterm>character encoding</qterm> is somewhat ambiguous,
				as it is sometimes used to describe the actual process of encoding characters
				and sometimes to denote a particular way to perform that process (as in
				<quote>this file is in the X character encoding</quote>). Context normally
				allows the distinction of those uses, once one is aware of the ambiguity.</p> 
		  </note> 
		  <note> 
			 <p>Unfortunately, there are some important cases of charset
				identifiers that denote a range of slight variants of an encoding scheme, where
				the differences may be crucial (e.g. the well-known yen/backslash case) and may
				vary over time. In those cases, recovery of the character sequence from a byte
				sequence is not totally unambiguous. See the
				<bibref ref="XML_Japanese_profile"/> for examples of such ambiguous
				charsets.</p> 
		  </note> 
		  <p>In very simple cases, the whole encoding process can be collapsed to
			 a single step, a trivial one-to-one mapping from characters to bytes; this is
			 the case, for instance, for US-ASCII <bibref ref="MIME"/> and ISO-8859-1.</p> 
		  <p id="Unicode_Encoding_Form">Text data is said to be in a
			 <term>Unicode encoding form</term> if it is encoded in UTF-8, UTF-16 or
			 UTF-32.</p> 
		</div2> 
		<div2 id="sec-Transcoding"><head>Transcoding</head> 
		  <p><term>Transcoding</term> is the process of converting text data from
			 one 
			 <loc href="#def-CEF">Character Encoding Form</loc> to another.
			 Transcoders work only at the level of character encoding and do not parse the
			 text; consequently, they do not deal with character escapes such as numeric
			 character references (see <specref ref="sec-Escaping"/>) and do not adjust
			 embedded character encoding information (for instance in an XML declaration or
			 in an HTML <el>meta</el> element).</p> 
		  <note> 
			 <p>Transcoding may involve one-to-one, many-to-one, one-to-many or
				many-to-many mappings. In addition, the storage order of characters varies
				between encodings: some, such as Unicode, prescribe logical ordering while
				others use visual ordering; among encodings that have separate diacritics, some
				prescribe that they be placed before the base character, some after. Because of
				these differences in sequencing characters, transcoding may involve reordering:
				thus XYZ may map to yxz.</p> 
		  </note> 
		  <p id="def-normalizing-transcoder">A <term>normalizing
			 transcoder</term> is a transcoder that converts from a legacy encoding to a
			 Unicode encoding form <emph>and</emph> ensures that the result is in Unicode
			 Normalization Form C (see <specref ref="sec-Unicode-normalized"/>). For most
			 legacy encodings, it is possible to construct a normalizing transcoder; it is
			 not possible to do so if the encoding's repertoire contains characters not in
			 Unicode.</p> 
		</div2> 
		<div2 id="sec-Strings"><head>Strings</head> 
		  <p>Various specifications use the notion of a <qterm>string</qterm>,
			 sometimes without defining precisely what is meant and sometimes defining it
			 differently from other specifications. The reason for this variability is that
			 there are in fact multiple reasonable definitions for a string, depending on
			 one's intended use of the notion; the term <qterm>string</qterm> is used for
			 all these different notions because these are actually just different views of
			 the same reality: a piece of text stored inside a computer. This section
			 provides specific definitions for different notions of <qterm>string</qterm>
			 which may be reused elsewhere.</p> 
		  <p id="def-byte-string"><term>Byte string</term>: A string viewed as a
			 sequence of bytes representing characters in a particular encoding. This
			 corresponds to a <termref def="def-CES">CES</termref>. As a definition for a
			 string, this definition is most often useless, except when the textual nature
			 is unimportant and the string is considered only as a piece of opaque data with
			 a length in bytes. <req><req-type>S</req-type><req-text>Specifications in
			 general <rfc2119>SHOULD NOT</rfc2119> define a string as a <qterm>byte
			 string</qterm>.</req-text></req> </p> 
		  <p id="def-physical-string"><term>Code unit string</term>: A string
			 viewed as a sequence of code units representing characters in a particular
			 encoding. This corresponds to a <termref def="def-CEF">CEF</termref>. This
			 definition is useful in APIs that expose a physical representation of string
			 data. Example: For the DOM <bibref ref="dom1"/>, UTF-16 was chosen based on
			 widespread implementation practice.</p> 
		  <p id="def-character-string"><term>Character string</term>: A string
			 viewed as a sequence of characters, each represented by a code point in Unicode
			 <bibref ref="unicode"/>. This is usually what programmers consider to be a
			 string, although it may not match exactly what most users perceive as
			 characters. This is the highest layer of abstraction that ensures
			 interoperability with very low implementation effort.
			 <req><req-type>S</req-type><req-text>The <qterm>character string</qterm>
			 definition of a string is generally the most useful and
			 <rfc2119>SHOULD</rfc2119> be used by most specifications, following the
			 examples of Production [2] of XML 1.0 <bibref ref="xml10"/>, the SGML
			 declaration of HTML 4.0 <bibref ref="html401"/>, and the character model of RFC
			 2070 <bibref ref="rfc2070"/>. </req-text></req></p><example><p>Consider the
		  string <image><graphic source="images/gads_diff_zero.gif" width="32" height="18"/> <alt>GOTHIC LETTER DAGS different from DIGIT ZERO</alt></image>
		  comprising the characters U+10333 <uname>GOTHIC LETTER DAGS</uname>, U+2260
		  <uname>NOT EQUAL TO</uname> and U+0030 <uname>DIGIT ZERO</uname>, encoded in
		  UTF-16 in big-endian byte order. The rows of the following table show the
		  string viewed as a character string, code unit string and byte string,
		  respectively:</p><figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th>Character string</th><td colspan="4">U+10333
					 <image><graphic source="images/gads.gif" width="14" height="18"/><alt>GOTHIC LETTER DAGS</alt></image></td><td colspan="2">U+2260 <image><graphic source="images/not_equal.gif" width="8" height="18"/><alt>NOT EQUAL TO</alt></image></td><td colspan="2">U+0030 <image><graphic source="images/zero.gif" width="9" height="18"/> 
					 <alt>ZERO</alt></image></td></tr><tr><td>Code unit string</td><td colspan="2">D800</td><td colspan="2">DF33</td><td colspan="2">2260</td><td colspan="2">0030</td></tr><tr><td>Byte
					 string</td><td>D8</td><td>00</td><td>DF</td><td>33</td><td>22</td><td>60</td><td>00</td><td>30</td></tr></tbody></table></figure></example> 
		  <note id="def-grapheme-string"> 
			 <p>It is also possible to view a string as a sequence of
				<term>graphemes</term>. In this case the string is divided into text units that
				correspond to the user's perception of where character boundaries occur in a
				visually rendered text. However, there is no standard rule for the segmentation
				of text in this way, and the segmentation will vary from language to language
				and even from user to user. Examples of possible approaches can be found in
				sections 5.12 and 5.15 of the Unicode Standard <bibref ref="unicode30"/>. </p> 
		  </note> 
		</div2> 
		<div2 id="sec-RefProcModel"><head>Reference Processing Model</head> 
		  <p>Many Internet protocols and data formats, most notably the very
			 important Web formats HTML, CSS and XML, are based on text. In those formats,
			 everything is text but the relevant specifications impose a structure on the
			 text, giving meaning to certain constructs so as to obtain functionality in
			 addition to that provided by plain text. HTML and XML are <term>markup
			 languages</term>, defining entities entirely composed of text but with
			 conventions allowing the separation of this text into <emph>markup</emph> and
			 <emph>character data</emph>. Citing from the XML 1.0 specification
			 <bibref ref="xml10"/>,
			 <xspecref href="http://www.w3.org/TR/2000/REC-xml-20001006#syntax">section
			 2.4</xspecref>:</p> 
		  <p><quote>Text consists of intermingled character data and markup.
			 [...] All text that is not markup constitutes the <term>character data</term>
			 of the document.</quote></p> 
		  <p>For the purposes of this section, the important aspect is that
			 everything is text, that is, a sequence of characters.</p> 
		  <p>Since its early days, the Web has seen the development of a
			 <term>Reference Processing Model</term>, first described for HTML in RFC 2070
			 <bibref ref="rfc2070"/>. This model was later embraced by XML and CSS. It is
			 applicable to any data format or protocol that is text-based as described
			 above. The essence of the Reference Processing Model is the use of Unicode as a
			 common reference. Use of the Reference Processing Model by a specification does
			 not, however, require that implementations actually use Unicode. The
			 requirement is only that the implementations behave as if the processing took
			 place as described by the Model.</p> 
		  <p>A specification conforms to the Reference Processing Model if all of
			 the following apply:</p> 
		  <ulist> 
			 <item> 
				<p><req><req-type>S </req-type><req-text>Specifications
				  <rfc2119>MUST</rfc2119> be defined in terms of Unicode characters, not bytes or
				  glyphs.</req-text></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>SHOULD</rfc2119> allow the use of the full range of Unicode code
				  points from 0 to 0x10FFFF inclusive; any exceptions <rfc2119>SHOULD</rfc2119>
				  be listed and justified; code points above 0x10FFFF <rfc2119>SHOULD
				  NOT</rfc2119> be used.</req-text></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>MAY</rfc2119> allow use of any character encoding which can be
				  transcoded to Unicode for its text entities.</req-text></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>MAY</rfc2119> choose to disallow or deprecate some encodings and to
				  make others mandatory. Independent of the actual encoding, the specified
				  behavior <rfc2119>MUST</rfc2119> be the same <emph>as if</emph> the processing
				  happened as follows:</req-text> 
				  <ulist> 
					 <item> 
						<p>The encoding of any text entity received by the
						  application implementing the specification <rfc2119>MUST</rfc2119> be
						  determined and the text entity <rfc2119>MUST</rfc2119> be interpreted as a
						  sequence of Unicode characters - this <rfc2119>MUST</rfc2119> be equivalent to
						  transcoding the entity to some Unicode encoding form, adjusting any character
						  encoding label if necessary, and receiving it in that Unicode encoding
						  form.</p> 
					 </item> 
					 <item> 
						<p>All processing <rfc2119>MUST</rfc2119> take place on this
						  sequence of Unicode characters.</p> 
					 </item> 
					 <item> 
						<p>If text is output by the application, the sequence of
						  Unicode characters <rfc2119>MUST</rfc2119> be encoded using an encoding chosen
						  among those allowed by the specification.</p> 
					 </item> 
				  </ulist></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>If a specification is such
				  that multiple text entities are involved (such as an XML document referring to
				  external parsed entities), it <rfc2119>MAY</rfc2119> choose to allow these
				  entities to be in different character encodings. In all cases, the
				  <termref def="sec-RefProcModel">Reference Processing Model</termref>
				  <rfc2119>MUST</rfc2119> be applied to all entities.</req-text></req></p> 
			 </item> 
		  </ulist> 
		  <p><req><req-type>S</req-type><req-text>All specifications that involve
			 text <rfc2119>MUST</rfc2119> specify processing according to the
			 <termref def="sec-RefProcModel">Reference Processing
			 Model</termref>.</req-text></req></p> 
		  <note> 
			 <p>All specifications that derive from the XML 1.0 specification
				<bibref ref="xml10"/> automatically inherit this Reference Processing Model.
				XML is entirely defined in terms of Unicode characters and mandates the UTF-8
				and UTF-16 encodings while allowing any other encoding for parsed entities.</p>
			 
		  </note> 
		  <note> 
			 <p>When specifications choose to allow encodings other than Unicode
				encodings, implementers should be aware that the correspondence between the
				characters of a legacy encoding and Unicode characters may in practice depend
				on the software used for transcoding. See the Japanese XML Profile
				<bibref ref="XML_Japanese_profile"/> for examples of such inconsistencies.</p> 
		  </note> 
		</div2> 
		<div2 id="sec-Encodings"><head>Choice and Identification of Character
			 Encodings</head> 
		  <p>Because encoded text <emph>cannot</emph> be interpreted and
			 processed without knowing the encoding, it is vitally important that the
			 character encoding scheme (see <specref ref="sec-Digital"/>) is known at all
			 times and places where text is exchanged or processed.
			 <req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST</rfc2119>
			 either specify a unique encoding, or provide character encoding identification
			 mechanisms such that the encoding of text can always be reliably
			 identified.</req-text></req> <req><req-type>S</req-type><req-text>When
			 designing a new protocol, format or API, specifications
			 <rfc2119>SHOULD</rfc2119> mandate a unique character
			 encoding.</req-text></req></p> 
		  <div3 id="sec-encoding-id"><head>Mandating a unique character
				encoding</head> 
			 <p>Mandating a unique character encoding is simple, efficient, and
				robust. There is no need for specifying, producing, transmitting, and
				interpreting encoding tags. At the receiver, the encoding will always be
				understood. There is also no ambiguity if data is transferred
				non-electronically and later has to be converted back to a digital
				representation. Even when there is a need for compatibility with existing data,
				systems, protocols and applications, multiple encodings can often be dealt with
				at the boundaries or outside a protocol, format, or API. The
				<acronym title="Document Object Model">DOM</acronym> <bibref ref="dom1"/> is an
				example of where this was done. The advantages of choosing a unique encoding
				become more important the smaller the pieces of text used are and the closer to
				actual processing the specification is.</p> 
			 <p><req><req-type>S</req-type><req-text>When a unique encoding is
				mandated, the encoding <rfc2119>MUST</rfc2119> be UTF-8, UTF-16 or
				UTF-32.</req-text></req> <req><req-type>S</req-type><req-text>If a unique
				encoding is mandated and compatibility with US-ASCII is desired, UTF-8 (see
				<bibref ref="rfc2279"/>) is <rfc2119>RECOMMENDED</rfc2119>.</req-text></req> In
				other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate.
				Possible reasons for choosing one of these include efficiency of internal
				processing and interoperability with other processes.</p> 
			 <note> 
				<p>The IETF Charset Policy <bibref ref="rfc2277"/> specifies that
				  on the Internet <quote>Protocols MUST be able to use the UTF-8
				  charset</quote>.</p> 
			 </note> 
			 <note> 
				<p>The XML 1.0 specification <bibref ref="xml10"/> requires all
				  conforming XML processors to accept both UTF-16 and UTF-8.</p> 
			 </note> 
		  </div3> 
		  <div3 id="sec-encoding-id2"><head>Character Encoding
				Identification</head> 
			 <p>The MIME Internet specification <bibref ref="MIME"/> provides a
				good example of a mechanism for character encoding identification. The MIME
				<kw>charset</kw> parameter definition is intended to supply sufficient
				information to uniquely decode the sequence of bytes of the received data into
				a sequence of characters. The values are drawn from the IANA charset registry
				<bibref ref="iana"/>.</p> 
			 <note> 
				<p>In practice there is wide variation among implementations, so
				  uniqueness cannot be depended upon. See the end of
				  <specref ref="sec-RefProcModel"/> for more information.</p> 
			 </note> 
			 <note> 
				<p>The term <qterm>charset</qterm> derives from <qterm>character
				  set</qterm>, an expression with a long and tortured history (see
				  <bibref ref="connolly"/> for a discussion).</p> 
			 </note> 
			 <p><req><req-type>S</req-type><req-text>Specifications
				<rfc2119>SHOULD</rfc2119> avoid using the expression <qterm>character
				set</qterm>, as well as the term <qterm>charset</qterm> to refer to a character
				encoding scheme, except when the latter is used to refer to the MIME
				<kw>charset</kw> parameter or its IANA-registered values. The terms
				<qterm>character encoding</qterm> or <qterm>character encoding scheme</qterm>
				are <rfc2119>RECOMMENDED</rfc2119>.</req-text></req></p> 
			 <note> 
				<p>In XML, the XML declaration or the text declaration contains a
				  pseudo-attribute called <kw>encoding</kw> which identifies the character
				  encoding using the IANA charset.</p> 
			 </note> 
			 <p>The IANA charset registry is the official list of names and
				aliases for character encodings on the Internet.</p> 
			 <p><req><req-type>S</req-type><req-text>If the unique encoding
				approach is not taken, specifications <rfc2119>SHOULD</rfc2119> mandate the use
				of the IANA charset registry names, and in particular the names identified in
				the registry as <qterm>MIME preferred names</qterm>, to designate character
				encodings in protocols, data formats and APIs.</req-text></req>
				<req><req-type>S</req-type><req-text>The <qterm>x-</qterm> convention for
				unregistered character encoding names <rfc2119>SHOULD NOT</rfc2119> be used,
				having led to abuse in the past.</req-text></req> (<qterm>x-</qterm> was used
				for character encodings that were widely used, even long after there was an
				official registration.)
				<req><req-type>I</req-type><req-type>C</req-type><req-text>Content and software
				that label textual data <rfc2119>MUST</rfc2119> use one of the names mandated
				by the appropriate specification (e.g. the XML specification when editing XML
				text) and <rfc2119>SHOULD</rfc2119> use the MIME preferred name of an encoding
				to label data in that encoding.</req-text></req>
				<req><req-type>I</req-type><req-type>C</req-type><req-text>An IANA-registered
				<kw>charset</kw> name <rfc2119>MUST NOT</rfc2119> be used to label textual data
				in an encoding other than the one identified in the IANA registration of that
				name.</req-text></req></p> 
			 <p><req><req-type>S</req-type><req-text>If the unique encoding
				approach is not chosen, specifications <rfc2119>MUST</rfc2119> designate at
				least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible
				encodings and <rfc2119>SHOULD</rfc2119> choose at least one of UTF-8 or UTF-16
				as mandated encoding forms (encoding forms that <rfc2119>MUST</rfc2119> be
				supported by implementations of the specification).</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>MAY</rfc2119>
				define either UTF-8 or UTF-16 as a default encoding form (or both if they
				define suitable means of distinguishing them), but they <rfc2119>MUST
				NOT</rfc2119> use any other character encoding as a default.</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST NOT</rfc2119>
				use heuristics to determine the encoding of data.</req-text></req></p> 
			 <p><req><req-type>I</req-type><req-text><emph>Receiving</emph>
				software <rfc2119>MUST</rfc2119> determine the encoding of data from available
				information according to appropriate specifications.</req-text></req>
				<req><req-type>I</req-type><req-text>When an IANA-registered <kw>charset</kw>
				name is recognized, receiving software <rfc2119>MUST</rfc2119> interpret the
				received data according to the encoding associated with the name in the IANA
				registry.</req-text></req> <req><req-type>I</req-type><req-text>When no charset
				is provided receiving software <rfc2119>MUST</rfc2119> adhere to the default
				encoding(s) specified in the specification.</req-text></req></p> 
			 <p><req><req-type>I</req-type><req-text>Receiving software
				<rfc2119>MAY</rfc2119> recognize as many encodings (names and aliases) as
				appropriate.</req-text></req> A field-upgradeable mechanism may be appropriate
				for this purpose. Certain encodings are more or less associated with certain
				languages (e.g. Shift-JIS with Japanese); trying to support a given language or
				set of customers may mean that certain encodings have to be supported. The
				encodings that need to be supported may change over time. This document does
				not give any advice on which encoding may be appropriate or necessary for the
				support of any given language.</p> 
			 <p><req><req-type>I</req-type><req-text>Software
				<rfc2119>MUST</rfc2119> completely implement the mechanisms for character
				encoding identification and <rfc2119>SHOULD</rfc2119> implement them in such a
				way that they are easy to use (for instance in HTTP servers).</req-text></req>
				<req><req-type>I</req-type><req-text>On interfaces to other protocols, software
				<rfc2119>SHOULD</rfc2119> support conversion between Unicode encoding forms as
				well as any other necessary conversions.</req-text></req></p> 
			 <p><req><req-type>C</req-type><req-text>Content
				<rfc2119>MUST</rfc2119> make use of available facilities for character encoding
				identification by always indicating character encoding; where the facilities
				offered for character encoding identification include defaults (e.g. in XML 1.0
				<bibref ref="xml10"/>), relying on such defaults is sufficient to satisfy this
				identification requirement.</req-text></req></p> 
			 <p>Because of the layered Web architecture (e.g. formats used over
				protocols), there may be multiple and at times conflicting information about
				character encoding. <req><req-type>S</req-type><req-text>Specifications
				<rfc2119>MUST</rfc2119> define conflict-resolution mechanisms (e.g. priorities)
				for cases where there is multiple or conflicting information about character
				encoding.</req-text></req>
				<req><req-type>I</req-type><req-type>C</req-type><req-text>Software and content
				<rfc2119>MUST</rfc2119> carefully follow conflict-resolution mechanisms where
				there is multiple or conflicting information about character
				encoding.</req-text></req></p> 
		  </div3> 
		  <div3 id="sec-private-use"><head>Private Use Code Points</head> 
			 <p>Unicode designates certain ranges of code points for private use:
				the Private Use Area (U+E000-F8FF) and planes 15 and 16 (U+F0000-FFFFD and
				U+100000-10FFFD). These code points are guaranteed to never be allocated to
				standard characters, and are available for use by private agreement between a
				producer and a recipient. However, their use is strongly discouraged, since
				private agreements do not scale on the Web. Code points from different private
				agreements may collide, and a private agreement and therefore the meaning of
				the code points can quickly get lost.</p> 
			 <p><req><req-type>S</req-type><req-text>Specifications <rfc2119>MUST
				NOT</rfc2119> define any assignments of private use code
				points.</req-text></req> <req><req-type>S</req-type><req-text>Conformance to a
				specification <rfc2119>MUST NOT</rfc2119> require the use of private use area
				characters.</req-text></req>
				<req><req-type>S</req-type><req-text>Specifications <rfc2119>SHOULD
				NOT</rfc2119> provide mechanisms for agreement on private use code points
				between parties and <rfc2119>MUST NOT</rfc2119> require the use of such
				mechanisms.</req-text></req>
				<req><req-type>S</req-type><req-type>I</req-type><req-text>Specifications and
				implementations <rfc2119>SHOULD</rfc2119> be designed in such a way as to not
				disallow the use of private use code points by private
				arrangement.</req-text></req> As an example, XML does not disallow the use of
				private use code points.</p> 
			 <p><req><req-type>S</req-type><req-text>Specifications
				<rfc2119>MAY</rfc2119> define markup to allow the transmission of symbols not
				in Unicode or to identify specific variants of Unicode
				characters.</req-text></req></p><example><p>MathML (see <bibref ref="mathml2"/>
			 <xspecref href="http://www.w3.org/TR/2001/PR-MathML2-20010108/chapter3.html#presm_mglyph">section
			 3.2.9</xspecref>) defines an element <el>mglyph</el> for mathematical symbols
			 not in Unicode.</p></example><example><p>SVG (see <bibref ref="svg"/>
			 <xspecref href="http://www.w3.org/TR/2000/CR-SVG-20001102/text.html#AlternateGlyphs">section
			 10.14</xspecref>) defines an element <el>altglyph</el> which allows the
			 identification of specific display variants of Unicode characters.</p></example> 
		  </div3> 
		</div2> 
		<div2 id="sec-Escaping"><head>Character Escaping</head> 
		  <p>In text-based protocols or formats where characters can be either
			 part of character data or of markup (see <specref ref="sec-RefProcModel"/>), it
			 is often the case that certain characters are designated as having certain
			 specific protocol/format functions in certain contexts (e.g.
			 <qchar>&lt;</qchar> and <qchar>&amp;</qchar> serve as markup delimiters in HTML
			 and XML). These syntax-significant characters cannot be used to represent
			 themselves in text data in the same way as all other characters do. Also, often
			 formats are represented in an encoding that does not allow to represent all
			 characters directly. </p> 
		  <p>To express syntax-significant or unrepresentable characters, a
			 technique called <term>escaping</term> is used. This works by creating an
			 additional syntactic construct, defining additional characters or defining
			 character sequences that have special meaning. Escaping a character means
			 expressing it using such a construct, appropriate to the format or protocol in
			 which the character appears; <term>expanding an escape</term> (or
			 <term>unescaping</term>) means replacing it with the character that it
			 represents. </p> 
		  <p>Certain guidelines apply to the way specifications define character
			 escapes. <req><req-type>S</req-type><req-text>The guidelines in this document
			 relating to the 
			 <loc href="#sec-Escaping">definition of character escapes</loc>
			 <rfc2119>MUST</rfc2119> be followed when designing new W3C protocols and
			 formats and <rfc2119>SHOULD</rfc2119> be followed as much as possible when
			 revising existing protocols and formats.</req-text></req></p> 
		  <ulist> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Specifications
				  <rfc2119>MUST NOT</rfc2119> invent a new escaping mechanism if an appropriate
				  one already exists. </req-text></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>The number of different
				  ways to escape a character <rfc2119>SHOULD</rfc2119> be minimized (ideally to
				  one).</req-text></req> [A well-known counter-example is that for historical
				  reasons, both HTML and XML have redundant decimal (&amp;#ddddd;) and
				  hexadecimal (&amp;#xhhhh;) escapes.]</p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Explicit end delimiters
				  <rfc2119>MUST</rfc2119> be provided. Escapes such as \uABCD where the end
				  delimiter is a space or any character other than [01-9A-F]
				  <rfc2119>SHOULD</rfc2119> be avoided.</req-text></req> These escapes are not
				  clear visually, and can cause an editor to insert spurious line-breaks when
				  word-wrapping on spaces. Forms like SPREAD's &amp;UABCD; <bibref ref="spread"/>or XML's &amp;#xhhhh;, where the escape is explicitly terminated
				  by a semicolon, are much better. </p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Whenever specifications
				  define escapes that allow the representation of characters using a number the
				  number <rfc2119>SHOULD</rfc2119> be in hexadecimal
				  notation.</req-text></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-text>Escaped characters
				  <rfc2119>SHOULD</rfc2119> be acceptable wherever unescaped characters are; this
				  does not preclude that a syntax-significant character, when escaped, loses its
				  significance in the syntax. In particular, escaped characters
				  <rfc2119>SHOULD</rfc2119> be acceptable in identifiers and
				  comments.</req-text></req></p> 
			 </item> 
		  </ulist> 
		  <p>Certain guidelines apply to content developers, as well as to
			 software that generates content:</p> 
		  <ulist> 
			 <item> 
				<p><req><req-type>I</req-type><req-type>C</req-type><req-text>Escapes
				  <rfc2119>SHOULD</rfc2119> be avoided when the characters to be expressed are
				  representable in the character encoding of the document.</req-text></req></p> 
			 </item> 
			 <item> 
				<p><req><req-type>I</req-type><req-type>C</req-type><req-text>Since
				  character set standards usually list character numbers as hexadecimal, content
				  <rfc2119>SHOULD</rfc2119> use the hexadecimal form of escapes when there is
				  one.</req-text></req></p> 
			 </item> 
		  </ulist> 
		</div2> 
	 </div1> 
	 <div1 id="sec-Normalization"><head>Early Uniform Normalization</head> 
		<p>This chapter discusses character data normalization for the Web.
		  <specref ref="sec-NormalizationMotivation"/> discusses the need for
		  normalization, and in particular early uniform normalization.
		  <specref ref="sec-TextNormalization"/> defines full normalization and gives
		  examples. <specref ref="sec-NormalizationApplication"/> assigns reponsibilities
		  to various components and situations.</p> 
		<div2 id="sec-NormalizationMotivation"><head>Motivation</head> 
		  <p>As explained at length in <titleref>Requirements for String Identity
			 Matching and String Indexing</titleref> <bibref ref="CharReq"/>, the existence,
			 in many character encoding schemes, of multiple representations for what users
			 perceive as the same string makes it necessary to define character data
			 normalization. Without a precise specification, it is not possible to determine
			 reliably whether or not two strings are identical. Such a specification must
			 take into account character encoding, the way normalization is to be performed
			 and where or when (by sender or recipient) to perform it.</p> 
		  <p>String identity is central to the correct functioning of much
			 software, and in particular of large parts of the Web infrastructure
			 (protocols, formats, etc.). Incorrect string matching can have far reaching
			 consequences, including the creation of security holes. Consider a contract,
			 encoded in XML, for buying goods: each item sold is described in an
			 <el>artículo</el> element; unfortunately, <quote>artículo</quote> is subject to
			 different representations in the character encoding of the contract. Suppose
			 that the contract is viewed and signed by means of a user agent that looks for
			 <el>artículo</el> elements, extracts them (matching on the element name),
			 presents them to the user and adds up their prices. If different instances of
			 the <el>artículo</el> element happen to be represented differently in a
			 particular contract, then the buyer and seller may see (and sign) different
			 contracts if their respective user agents perform string identity matching
			 differently, which is fairly likely in the absence of a well-defined
			 specification. The absence of a well-defined specification also means that
			 there is no way to resolve the ensuing contractual dispute.</p> 
		  <p>The Unicode Consortium provides four standard normalization forms
			 (see <titleref>Unicode Normalization Forms</titleref> <bibref ref="UTR15"/>).
			 For use on the Web, this document defines W3C Text Normalization by picking the
			 most appropriate of these (NFC) and additionally addressing the issues of
			 legacy encodings and of character escapes (which can denormalize text when
			 unescaped).</p> 
		  <p>Roughly speaking, NFC is defined such that combining character
			 sequences (a base character followed by one or more combining characters) are
			 replaced, as far as possible, by canonically equivalent precomposed characters.
			 Text in a Unicode encoding form is said to be in NFC if it doesn't contain any
			 combining sequence that could be replaced and if any remaining combining
			 sequence is in canonical order.</p> 
		  <p>This document also specifies that normalization is to be performed
			 <emph>early</emph> (by the sender) as opposed to <emph>late</emph> (by the
			 recipient). The reasons for that choice are manifold: 
			 <ulist> 
				<item> 
				  <p>Almost all legacy data as well as data created by current
					 software is normalized (using NFC).</p> 
				</item> 
				<item> 
				  <p>The number of Web components that generate or transform text
					 is considerably smaller than the number of components that receive text and
					 need to perform matching or other processes requiring normalized text.</p> 
				</item> 
				<item> 
				  <p>Current receiving components (browsers, XML parsers, etc.)
					 implicitly assume early normalization by not performing normalization
					 themselves. This is a vast legacy.</p> 
				</item> 
				<item> 
				  <p>Web components that generate and process text are in a much
					 better position to do normalization than other components; in particular, they
					 may be aware that they deal with a restricted repertoire only.</p> 
				</item> 
				<item> 
				  <p>Not all components of the Web that implement functions such as
					 string matching can reasonably be expected to do normalization. This, in
					 particular, applies to very small components and components in the lower layers
					 of the architecture.</p> 
				</item> 
				<item> 
				  <p>Forward-compatibility issues can be dealt with more easily:
					 less software needs to be updated, namely only the software that generates
					 newly introduced characters.</p> 
				</item> 
				<item> 
				  <p>It improves matching in cases where the character encoding is
					 partly undefined, such as URIs <bibref ref="rfc2396"/> in which non-ASCII bytes
					 have no defined meaning.</p> 
				</item> 
				<item> 
				  <p>It is a prerequisite for comparison of encrypted strings (see
					 <bibref ref="CharReq"/>, 
					 <loc href="http://www.w3.org/TR/WD-charreq#2.7">section
						2.7</loc>).</p> 
				</item> 
			 </ulist></p> 
		</div2> 
		<div2 id="sec-TextNormalization"><head>Definitions for W3C Text
			 Normalization</head> 
		  <div3 id="sec-Unicode-normalized"><head>Unicode-normalized Text</head> 
			 <p>Text data is, for the purposes of this specification,
				<term>Unicode-normalized</term> if it is in a 
				<loc href="#Unicode_Encoding_Form">Unicode encoding form</loc>
				<emph>and</emph> is in Unicode Normalization Form C (according to version 3.1.0
				of <bibref ref="UTR15"/>).</p> 
		  </div3> 
		  <div3 id="sec-fully-normalized"><head>Fully Normalized Text</head> 
			 <p>Text data is <term>fully normalized</term> if: 
				<olist> 
				  <item> 
					 <p>the data is Unicode-normalized <emph>and</emph> does not
						contain any character escapes whose unescaping would cause the data to become
						no longer Unicode-normalized; or</p> 
				  </item> 
				  <item> 
					 <p>the data is in a legacy encoding <emph>and</emph>, if it
						were transcoded to a Unicode encoding form by a 
						<loc href="#def-normalizing-transcoder">normalizing
						  transcoder</loc>, the resulting data would satisfy clause 1 above.</p> 
				  </item> 
				</olist></p> 
			 <p>In the remainder of this specification, <term>normalized</term> is
				used to mean <qterm>fully normalized</qterm>, unless otherwise indicated.</p> 
			 <note> 
				<p>A consequence of this definition is that legacy text (i.e. text
				  in a legacy encoding) is always normalized unless i) a normalizing transcoder
				  cannot exist for that encoding (e.g. because the repertoire contains characters
				  not in Unicode) or ii) the text contains escapes which, once expanded, result
				  in un-normalized text.</p> 
			 </note> 
			 <note> 
				<p>Full normalization is specified against the context of a markup
				  language (or the absence thereof), which specifies the form of escapes. For
				  plain text (no escapes) in a Unicode encoding form, full normalization and
				  Unicode-normalization are equivalent.</p> 
			 </note> 
		  </div3> 
		  <div3 id="sec-normalization-examples"><head>Examples</head> 
			 <p>The string <qterm>suçon</qterm>, expressed as the sequence of five
				characters U+0073 U+0075 U+00E7 U+006F U+006E and encoded in a Unicode encoding
				form, is both Unicode-normalized and fully normalized. The same string encoded
				in a legacy encoding for which there exists a normalizing transcoder would be
				fully normalized but not Unicode-normalized (since not in a Unicode encoding
				form).</p> 
			 <p>In an XML or HTML context, the string <code>su&amp;#xE7;on</code> is also both fully normalized and, if encoded in a Unicode encoding
				form, Unicode-normalized. Expanding &amp;#xE7; yields <code>suçon</code> as above, which contains no replaceable combining sequence.</p> 
			 <p>The string <qterm>suçon</qterm>, expressed as the sequence of
				<emph>six</emph> characters U+0073 U+0075 <emph>U+0063 U+0327</emph> U+006F
				U+006E (U+0327 is the <uname>COMBINING CEDILLA</uname>) and encoded in a
				Unicode encoding form, is neither Unicode-normalized (since the combining
				sequence U+0063 U+0327 is replaceable by the precomposed U+00E7
				<qchar>ç</qchar>) nor fully normalized (since in a Unicode encoding form but
				not Unicode-normalized).</p> 
			 <p>In an XML or HTML context, the string <code>suc&amp;#x0327;on</code> is not fully normalized, regardless of encoding form, because expanding
				&amp;#x0327; yields the sequence <code>suc¸on</code> which is not Unicode-normalized (<qchar>c¸</qchar> is replaceable by
				<qchar>ç</qchar>). Unicode-normalization, however, is defined only for plain
				text, doesn't know that &amp;#x0327; represents a character in XML or HTML and
				considers it just a sequence of characters. Therefore, the string <code>suc&amp;#x0327;on</code> in a Unicode encoding form <emph>is</emph> Unicode-normalized since it
				contains no replaceable combining sequence. (The latter example does not imply
				that Unicode-normalization is sufficient to meet the normalization requirements
				of the Web; it just illustrates a case where Unicode-normalization and full
				normalization differ).</p> 
			 <p>The string <code>&lt;elem&gt;/ foobar&lt;/elem&gt;</code>, where the <qchar>/</qchar> immediately after <code>&lt;elem&gt;</code> stands for the character U+0338 <uname>COMBINING LONG SOLIDUS
				OVERLAY</uname>, is neither Unicode-normalized nor fully normalized, since the
				U+0338 <qchar>/</qchar> combines with the <qchar>&gt;</qchar> (yielding U+226F
				<uname>NOT GREATER-THAN</uname>). </p> 
			 <note> 
				<p>From this example, it follows that it is impossible to produce a
				  normalized XML or HTML document containing the character U+0338
				  <uname>COMBINING LONG SOLIDUS OVERLAY</uname> immediately following an element
				  tag, comment, CDATA section or processing instruction. It is noteworthy that
				  U+0338 <uname>COMBINING LONG SOLIDUS OVERLAY</uname> also combines with
				  <qchar>&lt;</qchar>, yielding U+226E <uname>NOT LESS-THAN</uname>.
				  Consequently, U+0338 <uname>COMBINING LONG SOLIDUS OVERLAY</uname> should
				  remain excluded from XML identifiers.</p> 
			 </note> 
		  </div3> 
		</div2> 
		<div2 id="sec-NormalizationApplication"><head>Responsibility for
			 Normalization</head> 
		  <p>This section defines the responsibilities for normalization for
			 various components and situations, based on the goal of early uniform
			 normalization.</p> 
		  <p><req><req-type>C</req-type><req-text>All content on the Web
			 <rfc2119>MUST</rfc2119> be fully normalized.</req-text></req></p> 
		  <p><req><req-type>I</req-type><req-text>Producers
			 <rfc2119>MUST</rfc2119> produce text data in normalized form, unless they are
			 willing to accept the consequences (loss of integrity and security, high
			 probability of rejection by recipients) of un-normalized data.</req-text></req>
			 For the purpose of W3C specifications and their implementations, the producer
			 of text data is the sender of the data in the case of protocols and the tool
			 that produces the data in the case of formats.</p> 
		  <note> 
			 <p>As an optimization, it is perfectly acceptable for a
				<emph>system</emph> to define the producer to be the actual producer (e.g. a
				small device) together with a remote component (e.g. a server serving as a kind
				of proxy) to which normalization is delegated. In such a case, the
				communications channel between the device and proxy server is considered to be
				<emph>internal</emph> to the system, not part of the Web. Only data normalized
				by the proxy server is to be exposed to the Web at large, as shown in the
				illustration below:</p> <figure><image><graphic source="images/producer_proxy.png" height="400" width="500"/><alt>Illustration of a text producer defined as including a proxy.</alt></image> 
			 <caption>Illustration of a text producer defined as including a
				proxy.</caption></figure> 
		  </note> 
		  <note> 
			 <p>Normalization is the responsibility of the producer as a whole.
				This specification does not assign responsibility for normalization to any
				particular component of the producer (for instance a DOM implementation).</p> 
		  </note> 
		  <note> 
			 <p>Implementers of producer software are encouraged to delegate
				normalization to their respective data sources wherever possible. Examples of
				data sources are operating systems, libraries, and keyboard drivers. One way of
				ensuring that user input results in normalized data is to not provide any way
				of creating denormalized data.</p> 
		  </note> 
		  <p><req><req-type>I</req-type><req-text>The recipients of text data
			 <rfc2119>MUST</rfc2119> verify the normalization of data they receive and
			 reject un-normalized data, unless they are willing to accept the consequences
			 (loss of integrity and security) of un-normalized data.</req-text></req>
			 <req><req-type>I</req-type><req-text>Recipients <rfc2119>MUST NOT</rfc2119>
			 normalize the data that they receive.</req-text></req>
			 <req><req-type>I</req-type><req-text>Recipients which transcode text data from
			 a legacy encoding to a Unicode encoding form <rfc2119>MUST</rfc2119> use a 
			 <loc href="#def-normalizing-transcoder">normalizing
				transcoder</loc>.</req-text></req></p> 
		  <note> 
			 <p>The prohibition of normalization by recipients is necessary to
				avoid the security issues mentioned in section
				<specref ref="sec-NormalizationMotivation"/>.</p> 
		  </note> 
		  <p><req><req-type>I</req-type><req-text>When a recipient returns
			 un-normalized text to a sender (e.g. to indicate an error or fault), that
			 recipient <rfc2119>MAY</rfc2119> return the text without normalizing
			 it.</req-text></req></p> 
		  <p><req><req-type>I</req-type><req-text>If a software module functions
			 as both a producer and a recipient of text data (e.g. a browser/editor),
			 normalization <rfc2119>MUST</rfc2119> be applied in the producer part but
			 <rfc2119>MUST NOT</rfc2119> be applied in the recipient
			 part.</req-text></req></p> 
		  <p><req><req-type>I</req-type><req-text>Intermediate
			 (recipient/producer) components whose role involves modification of text data
			 <rfc2119>MUST</rfc2119> ensure that their modifications do not result in
			 denormalization of any data exposed (sent on the network, saved to disk,
			 returned in an API call, etc.).</req-text></req></p> 
		  <note> 
			 <p>Consequently, an intermediate component, in a system that packages
				a payload in some control information, may modify the control information
				without having to renormalize the payload.</p> 
		  </note> 
		  <p><req><req-type>I</req-type><req-text>Software
			 <rfc2119>MUST</rfc2119> behave <emph>as if</emph> normalization took place
			 after each modification, so that any subsequent matching, indexing or other
			 normalization-sensitive operations always behave <emph>as if</emph> they were
			 dealing with normalized data.</req-text></req></p><example><p>If the
		  <qchar>z</qchar> is deleted from the (normalized) string <code>cz¸</code> (where <qchar>¸</qchar> represents a combining cedilla, U+0327),
		  normalization is necessary to turn the denormalized result <code>c¸</code> into the properly normalized <code>ç</code>. Analogous cases exist for insertion and concatenation. If the software
		  that deletes the <qchar>z</qchar> later uses the string in a
		  normalization-sensitive operation, it needs to normalize the string before this
		  operation to ensure correctness; otherwise, normalization may be deferred until
		  the data is exposed.</p></example> 
		  <p><req><req-type>I</req-type><req-text>Intermediate components whose
			 role does not involve modification of the data (e.g. caching proxies)
			 <rfc2119>MUST NOT</rfc2119> reject un-normalized data and <rfc2119>MUST
			 NOT</rfc2119> perform normalization.</req-text></req></p> 
		  <p><req><req-type>S</req-type><req-text>In specifications of markup
			 languages, syntax-significant characters <rfc2119>MUST</rfc2119> be chosen that
			 do not combine with any other characters in NFC.</req-text></req> This is to
			 avoid problems such as U+0338 <uname>COMBINING LONG SOLIDUS OVERLAY</uname>
			 combining with the <qchar>&lt;</qchar> and <qchar>&gt;</qchar> delimiters in
			 XML (see the last example in section <specref ref="sec-normalization-examples"/> above).</p> 
		</div2> 
	 </div1> 
	 <div1 id="sec-Compatibility"><head>Compatibility and Formatting
		  Characters</head> 
		<p>This specification does not address the suitability of particular
		  characters for use in markup languages, in particular formatting characters and
		  compatibility equivalents. For detailed recommendations about the use of
		  compatibility and formatting characters, see <titleref>Unicode in XML and other
		  Markup Languages</titleref> <bibref ref="UXML"/>.</p> 
		<p><req><req-type>S</req-type><req-text>Specifications
		  <rfc2119>SHOULD</rfc2119> exclude compatibility characters in the syntactic
		  elements (markup, delimiters, identifiers) of the formats they
		  define.</req-text></req></p> 
	 </div1> 
	 <div1 id="sec-IdentityMatching"><head>String Identity Matching</head> 
		<p>One important operation that depends on early normalization is
		  <term>string identity matching</term> <bibref ref="CharReq"/>, which is a
		  subset of the more general problem of string matching. There are various
		  degrees of specificity for string matching, from approximate matching such as
		  regular expressions or phonetic matching, to more specific matches such as
		  case-insensitive or accent-insensitive matching and finally to identity
		  matching. In the Web environment, where multiple encodings are used to
		  represent strings, including some encodings which allow multiple
		  representations for the same thing, <term>identity</term> is defined to occur
		  if and only if the compared strings contain no user-identifiable distinctions.
		  This definition is such that strings do not match when they differ in case or
		  accentuation, but do match when they differ only in non-semantically
		  significant ways such as encoding, use of escapes (of potentially different
		  kinds), or use of precomposed vs. decomposed character sequences.</p> 
		<p id="sid-steps">To avoid unnecessary conversions and, more importantly,
		  to ensure predictability and correctness, it is necessary for all components of
		  the Web to use the same identity testing mechanism. Conformance to the rule
		  that follows meets this requirement and supports the above definition of
		  identity. <req><req-type>S</req-type><req-type>I</req-type><req-text>String
		  identity matching <rfc2119>MUST</rfc2119> be performed as if the following
		  steps were followed:</req-text> 
		  <olist> 
			 <item> 
				<p>Early uniform normalization to fully normalized form, as defined
				  in <specref ref="sec-fully-normalized"/>. In accordance with section
				  <specref ref="sec-Normalization"/>, this step <rfc2119>MUST</rfc2119> be
				  performed by the <emph>producers</emph> of the strings to be compared.</p> 
			 </item> 
			 <item> 
				<p>Conversion to a common encoding of UCS, if necessary.</p> 
			 </item> 
			 <item> 
				<p>Expansion of all escapes.</p> 
			 </item> 
			 <item> 
				<p>Testing for bit-by-bit identity.</p> 
			 </item> 
		  </olist></req></p> 
		<p>Step 1 ensures 1) that the identity matching process can produce
		  correct results using the next three steps and 2) that a minimum of effort is
		  spent on solving the problem.</p> 
		<p><req><req-type>S</req-type><req-type>I</req-type><req-text>Forms of
		  string matching other than identity <rfc2119>SHOULD</rfc2119> be based on the 
		  <loc href="#sid-steps">steps</loc> specified in this document for
		  string identity matching.</req-text></req> Taking into account normalization
		  and escapes is necessary so that, for example, a case-insensitive match of <code>suçon</code> against <code>suc&amp;#xE7;on</code> or against <code>SUC¸ON</code> returns <kw>TRUE</kw>.</p> 
		<note> 
		  <p>The expansion of escapes (step 3 above) is dependent on context,
			 i.e. on which markup or programming language is considered to apply when the
			 string matching operation is performed. Consider a search for the string
			 <qterm>suçon</qterm> in an XML document containing <code>suc&amp;#xE7;on</code> but not <code>suçon</code>. If the search is performed in a plain text editor, the context is
			 <term>plain text</term> (no markup or programming language applies), the
			 &amp;#xE7; escape is not recognized, hence not expanded and the search fails.
			 If the search is performed in an XML browser, the context is <term>XML</term>,
			 the escape (defined by XML) is expanded and the search succeeds. </p> 
		  <p>An intermediate case would be an XML editor that
			 <emph>purposefully</emph> provides a view of an XML document with entity
			 references left unexpanded. In that case, a search over that pseudo-XML view
			 will deliberately <emph>not</emph> expand entities: in that particular context,
			 entity references are not considered escapes and need not be expanded.</p> 
		</note> 
	 </div1> 
	 <div1 id="sec-Indexing"><head>String Indexing</head> 
		<p>There are many situations where a software process needs to access a
		  substring or to point within a string and does so by the use of
		  <term>indices</term>, i.e. numeric <quote>positions</quote> within a string.
		  Where such indices are exchanged between components of the Web, there is a need
		  for an agreed-upon definition of string indexing in order to ensure consistent
		  behavior. The requirements for string indexing are discussed in
		  <titleref>Requirements for String Identity Matching</titleref>
		  <bibref ref="CharReq"/>, 
		  <loc href="http://www.w3.org/TR/WD-charreq#4">section 4</loc>. The two
		  main questions that arise are: <quote>What is the unit of counting?</quote> and
		  <quote>Do we start counting at 0 or 1?</quote>.</p> 
		<p>Depending on the particular requirements of a process, the unit of
		  counting may correspond to any of the definitions of a string provided in
		  section <specref ref="sec-Strings"/>. In particular: 
		  <ulist> 
			 <item> 
				<p><req><req-type>S</req-type><req-type>I</req-type><req-text>The
				  <termref def="def-character-string">character string</termref> is
				  <rfc2119>RECOMMENDED</rfc2119> as a basis for string indexing.</req-text></req>
				  (Example: the XML Path Language <bibref ref="xpath"/>).</p> 
			 </item> 
			 <item> 
				<p><req><req-type>S</req-type><req-type>I</req-type><req-text>A
				  <termref def="def-physical-string">code unit string</termref>
				  <rfc2119>MAY</rfc2119> be used as a basis for string indexing if this results
				  in a significant improvement in the efficiency of internal operations when
				  compared to the use of character string.</req-text></req> (Example: the use of
				  UTF-16 in <bibref ref="dom1"/>).</p> 
			 </item> 
		  </ulist></p> 
		<p>Counting <termref def="def-grapheme-string">graphemes</termref> will
		  become a good option where user interaction is the primary concern, once a
		  suitable definition is widely accepted.</p> 
		<p>It is noteworthy that there exist other, non-numeric ways of
		  identifying substrings which have favorable properties. For instance,
		  substrings based on string matching are quite robust against small edits;
		  substrings based on document structure (in structured formats such as XML) are
		  even more robust against edits and even against translation of a document from
		  one human language to another.
		  <req><req-type>S</req-type><req-text>Specifications that need a way to identify
		  substrings or point within a string <rfc2119>SHOULD</rfc2119> provide ways
		  other than string indexing to perform this operation.</req-text></req>
		  <req><req-type>I</req-type><req-type>C</req-type><req-text>Users of
		  specifications (software developers, content developers)
		  <rfc2119>SHOULD</rfc2119> whenever possible prefer ways other than string
		  indexing to identify substrings or point within a string.</req-text></req></p> 
		<p>Experience shows that more general, flexible and robust specifications
		  result when individual characters are understood and processed as substrings,
		  identified by a position before and a position after the substring.
		  Understanding indices as boundary positions <emph>between</emph> the counting
		  units also makes it easier to relate the indices resulting from the different
		  string definitions. <req><req-type>S</req-type><req-text>Specifications
		  <rfc2119>SHOULD</rfc2119> understand and process single characters as
		  substrings, and treat indices as boundary positions <emph>between</emph>
		  counting units, regardless of the choice of counting
		  units.</req-text></req></p> 
		<p><req><req-type>S</req-type><req-text>Specifications of APIs
		  <rfc2119>SHOULD NOT</rfc2119> specify single character or single encoding-unit
		  arguments.</req-text></req></p><example><p><code>uppercase('ß')</code> cannot return the proper result (the two-character string
		<qchar>SS</qchar>) if the return type of the <function>uppercase</function>
		function is defined to be a single character.</p></example> 
		<p>The issue of index origin, i.e. whether we count from 0 or 1, actually
		  arises only after a decision has been made on whether it is the units
		  themselves that are counted or the positions between the units.
		  <req><req-type>S</req-type><req-text>When the positions between the units are
		  counted for string indexing, starting with an index of 0 for the position at
		  the start of the string is the <rfc2119>RECOMMENDED</rfc2119> solution, with
		  the last index then being equal to the number of counting units in the
		  string.</req-text></req></p> 
	 </div1> 
	 <div1 id="sec-URIs"><head>Character Encoding in URI References</head> 
		<p>According to the definition in RFC 2396 <bibref ref="rfc2396"/>, URI
		  references are restricted to a subset of US-ASCII, with an escaping mechanism
		  to encode arbitrary byte values, using the %HH convention. However, the %HH
		  convention by itself is of limited use because there is no definitive mapping
		  from characters to bytes. Also, non-ASCII characters cannot be used directly.
		  <titleref>Internationalized Resource Identifiers (IRI)</titleref>
		  <bibref ref="uri-i18n"/> solves both problems with an uniform approach that
		  conforms to the 
		  <loc href="#sec-RefProcModel">Reference Processing Model</loc>. </p> 
		<p><req><req-type>S</req-type><req-text>W3C specifications that define
		  protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are
		  to be interpreted as URI references (or specific subsets of URI references,
		  such as absolute URI references, URIs, etc.) <rfc2119>MUST</rfc2119> use
		  <titleref>Internationalized Resource Identifiers (IRI)</titleref>
		  <bibref ref="uri-i18n"/> (or an appropriate subset thereof).</req-text></req>
		  <req><req-type>S</req-type><req-text>W3C specifications <rfc2119>MUST</rfc2119>
		  define when the conversion from IRI references to URI references (or subsets
		  thereof) takes place, in accordance with <titleref>Internationalized Resource
		  Identifiers (IRI)</titleref> <bibref ref="uri-i18n"/>.</req-text></req></p> 
		<note> 
		  <p>Many current W3C specifications already contain provisions in
			 accordance with <titleref>Internationalized Resource Identifiers
			 (IRI)</titleref> <bibref ref="uri-i18n"/>. For XML 1.0 <bibref ref="xml10"/>,
			 see <xspecref href="http://www.w3.org/TR/REC-xml#sec-external-ent">Section
			 4.2.2, External Entities</xspecref>, and
			 <xspecref href="http://www.w3.org/XML/xml-V10-2e-errata#E26">Erratum
			 E26</xspecref>. XML Schema Part 2: Datatypes <bibref ref="xmlschema-2"/>
			 provides the <kw>anyURI</kw> datatype (see
			 <xspecref href="http://www.w3.org/TR/xmlschema-2/#anyURI">Section
			 3.2.17</xspecref>). The XML Linking Language (XLink) <bibref ref="xlink"/>
			 provides the href attribute (see
			 <xspecref href="http://www.w3.org/TR/xlink/#link-locators">Section 5.4, Locator
			 Attribute</xspecref>). Further information and links can be found at
			 <titleref>Internationalization: URIs and other identifiers</titleref>
			 <bibref ref="i18nuri"/>.</p> 
		</note> 
		<p><req><req-type>S</req-type><req-text>W3C specifications that define
		  new syntax for URIs, such as a new URI scheme or a new kind of fragment
		  identifier, <rfc2119>MUST</rfc2119> specify that characters outside the
		  US-ASCII repertoire are encoded using UTF-8 and %HH-escaping, in accordance
		  with <titleref>Guidelines for new URL Schemes</titleref> <bibref ref="rfc2718"/>, Section 2.2.5.</req-text></req> This will make sure that these
		  schemes or fragment identifiers can be used in IRIs in the natural way.</p> 
	 </div1> 
	 <div1 id="sec-RefUnicode"><head>Referencing the Unicode Standard and
		  ISO/IEC 10646</head> 
		<p>Specifications often need to make references to the Unicode standard
		  or International Standard ISO/IEC 10646. Such references must be made with
		  care, especially when normative. The questions to be considered are: 
		  <ulist> 
			 <item> 
				<p>Which standard should be referenced?</p> 
			 </item> 
			 <item> 
				<p>How to reference a particular version?</p> 
			 </item> 
			 <item> 
				<p>When to use versioned vs unversioned references?</p> 
			 </item> 
		  </ulist></p> 
		<p>ISO/IEC 10646 is developed and published jointly by 
		  <loc href="http://www.iso.ch/">ISO</loc> (the International
		  Organisation for Standardisation) and 
		  <loc href="http://www.iec.ch/">IEC</loc> (the International
		  Electrotechnical Commission). The Unicode Standard is developed and published
		  by the 
		  <loc href="http://www.unicode.org/">Unicode Consortium</loc>, an
		  organization of major computer corporations, software producers, database
		  vendors, national governments, research institutions, international agencies,
		  various user groups, and interested individuals. The Unicode Standard is
		  comparable in standing to W3C Recommendations.</p> 
		<p>ISO/IEC 10646 and Unicode define exactly the same CCS (same
		  repertoire, same code points) and encoding forms. They are actively maintained
		  in synchrony by liaisons and overlapping membership between the respective
		  technical committees. In addition to the jointly defined CCS and encoding
		  forms, the Unicode Standard adds normative and informative lists of character
		  properties, normative character equivalence and normalization specifications, a
		  normative algorithm for bidirectional text and a large amount of useful
		  implementation information. In short, Unicode adds semantics to the characters
		  that ISO/IEC 10646 merely enumerates. Conformance to Unicode implies
		  conformance to ISO/IEC 10646, see <bibref ref="unicode30"/> Appendix C.</p> 
		<p><req><req-type>S</req-type><req-text>Since specifications in general
		  need both a definition for their characters and the semantics associated with
		  these characters, specifications <rfc2119>SHOULD</rfc2119> include a reference
		  to the Unicode Standard, whether or not they include a reference to ISO/IEC
		  10646.</req-text></req> By providing a reference to The Unicode Standard
		  implementers can benefit from the wealth of information provided in the
		  standard and on the Unicode Consortium Web site.</p> 
		<p>The fact that both ISO/IEC 10646 and Unicode are evolving (in
		  synchrony) raises the issue of versioning: should a specification refer to a
		  specific version of the standard, or should it make a generic reference, so
		  that the normative reference is to the version current at the time of
		  <emph>reading</emph> the specification? In general the answer is
		  <emph>both</emph>. <req><req-type>S</req-type><req-text>A generic reference to
		  the Unicode Standard <rfc2119>MUST</rfc2119> be made if it is desired that
		  characters allocated after a specification is published are usable with that
		  specification. A specific reference to the Unicode Standard
		  <rfc2119>MAY</rfc2119> be included to ensure that functionality depending on a
		  particular version is available and will not change over time (an example would
		  be the set of characters acceptable as Name characters in XML 1.0
		  <bibref ref="xml10"/>, which is an enumerated list that parsers must implement
		  to validate names).</req-text></req></p> 
		<note> 
		  <p>See<loc href="http://www.unicode.org/unicode/standard/versions/#Citations">
				http://www.unicode.org/unicode/standard/versions/#Citations</loc> for guidance
			 on referring to specific versions of Unicode.</p> 
		</note> 
		<p>A generic reference can be formulated in two ways: 
		  <olist> 
			 <item> 
				<p>By explicitly including a <emph>generic</emph> entry in the
				  bibliography section of a specification and simply referring to that entry in
				  the body of the specification. Such a generic entry contains text such as
				  <quote>... as it may from time to time be revised or amended</quote>.</p> 
			 </item> 
			 <item> 
				<p>By including a <emph>specific</emph> entry in the bibliography
				  and adding text such as <quote>... as it may from time to time be revised or
				  amended</quote> at the point of reference in the body of the specification.</p>
				
			 </item> 
		  </olist></p> 
		<p>It is an editorial matter, best left to each specification, which of
		  these two formulations is used. Examples of the first formulation can be found
		  in the bibliography of this specification (see the entries for
		  <bibref ref="iso10646"/> and <bibref ref="unicode"/>). Examples of the latter,
		  as well as a discussion of the versioning issue with respect to MIME
		  <kw>charset</kw> parameters for UCS encodings, can be found in
		  <bibref ref="rfc2279"/> and <bibref ref="rfc2781"/>.</p> 
		<p><req><req-type>S</req-type><req-text>All <emph>generic</emph>
		  references to Unicode <rfc2119>MUST</rfc2119> refer to Unicode 3.0<bibref ref="unicode30"/> or later.</req-text></req>
		  <req><req-type>S</req-type><req-text>Generic references to ISO/IEC 10646
		  <rfc2119>MUST</rfc2119> be written such that they make allowance for the future
		  publication of additional <emph>parts</emph> of the standard. They
		  <rfc2119>MUST</rfc2119> refer to ISO/IEC 10646-1:2000
		  <bibref ref="iso10646-2000"/> or later, including any
		  amendments.</req-text></req></p> 
	 </div1> 
  </body> 
  <back> 
	 <div1 id="sec-CharExamples"><head>Examples of Characters, Keystrokes and
		  Glyphs</head> 
		<p id="exampleA6">A few examples will help make sense all this complexity
		  of text in computers (which is mostly a reflection of the complexity of human
		  writing systems). Let us start with a very simple example: a user, equipped
		  with a US-English keyboard, types <quote>Foo</quote>, which the computer
		  encodes as 16-bit values (the UTF-16 encoding of Unicode) and displays on the
		  screen.</p> <figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th align="right">Keystrokes</th><td align="center">Shift-f</td><td align="center">o</td><td align="center">o</td></tr><tr><th align="right">Input characters</th><td align="center">F</td><td align="center">o</td><td align="center">o</td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td align="center">0046</td><td align="center">006F</td><td align="center">006F</td></tr><tr><th align="right">Display</th><td colspan="3" align="center">Foo</td></tr></tbody></table> 
		<caption>Example A.1: Basic Latin</caption></figure> 
		<p>The only complexity here is the use of a modifier (Shift) to input the
		  capital <qchar>F</qchar>.</p> 
		<p>A slightly more complex example is a user typing <qchar>çé</qchar> on
		  a traditional French-Canadian keyboard, which the computer again encodes in
		  UTF-16 and displays. We assume that this particular computer uses a fully
		  composed form of UTF-16.</p><figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th align="right">Keystrokes</th><td align="center">
				  ¸ </td><td align="center">c</td><td align="center">é</td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center">ç</td><td align="center">é</td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td colspan="2" align="center">00E7</td><td align="center">00E9</td></tr><tr><th align="right">Display</th><td colspan="3" align="center">çé</td></tr></tbody></table> 
		<caption>Example A.2: Latin with diacritics</caption></figure> 
		<p>A few interesting things are happening here: when the user types the
		  cedilla (<qchar>¸</qchar>), nothing happens except for a change of state of the
		  keyboard driver; the cedilla is a <term>dead key</term>. When the driver gets
		  the c keystroke, it provides a complete <qchar>ç</qchar> character to the
		  system, which represents it as a single 16-bit code unit and displays a
		  <qchar>ç</qchar> glyph. The user then presses the dedicated <qchar>é</qchar>
		  key, which results in, again, a character represented by two bytes. Most
		  systems will display this as one glyph, but it is also possible to combine two
		  glyphs (the base letter and the accent) to obtain the same rendering.</p> 
		<p>On to a Japanese example: our user employs a <term>romaji input
		  method</term> to type "<image><graphic source="images/nihongo.gif" width="47" height="16"/><alt>nihongo in Kanji characters</alt></image>", which the
		  computer encodes in UTF-16 and displays.</p><figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th align="right">Keystrokes</th><td align="center" colspan="4"> n i h o n g o &lt;space&gt;
				  &lt;return&gt;</td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center"><image><graphic source="images/ni.gif" width="14" height="16"/> <alt>kana character ni</alt></image></td><td align="center"><image><graphic source="images/hon.gif" width="15" height="16"/>
				  <alt>kana character hon</alt></image></td><td align="center"><image><graphic source="images/go.gif" width="16" height="16"/><alt>kana character go</alt></image></td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td colspan="2" align="center">65E5</td><td align="center">672C</td><td align="center">8A9E</td></tr><tr><th align="right">Display</th><td colspan="4" align="center"><image><graphic source="images/nihongo.gif" width="47" height="16"/><alt>nihongo in kanji characters</alt></image></td></tr></tbody></table> 
		<caption>Example A.3: Japanese</caption></figure> 
		<p>The interesting aspect here is input: the user types Latin characters,
		  which are converted on the fly to kana (not shown here), and then to kanji when
		  the user requests conversion by pressing &lt;space&gt;; the kanji characters
		  are finally sent to the application when the user presses &lt;return&gt;. The
		  user has to type a total of nine keystrokes before the three characters are
		  produced, which are then encoded and displayed rather trivially.</p> 
		<p>An Arabic example will show different phenomena:</p> <figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th align="right">Keystrokes</th><td align="center" colspan="2"><image><graphic source="images/lam.gif" width="15" height="25"/><alt>Arabic lam</alt></image>
				  </td><td align="center"><image><graphic source="images/alif.gif" width="11" height="25"/><alt>Arabic alef</alt></image></td><td align="center" colspan="2"><image><graphic source="images/lamalif.gif" width="15" height="25"/> <alt>Arabic lam-alef</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/><alt>Arabic ghayn</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/> <alt>Arabic ghayn</alt></image></td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center"><image><graphic source="images/lam.gif" width="15" height="25"/>
				  <alt>Arabic lam</alt></image></td><td align="center"><image><graphic source="images/alif.gif" width="11" height="25"/> <alt>Arabic alef</alt></image></td><td align="center"><image><graphic source="images/lam.gif" width="15" height="25"/> <alt>Arabic lam</alt></image></td><td align="center"><image><graphic source="images/alif.gif" width="11" height="25"/><alt>Arabic alef</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/> <alt>Arabic ghayn</alt></image></td><td align="center"><image><graphic source="images/ghayn.gif" width="18" height="25"/> <alt>Arabic ghayn</alt></image></td></tr><tr><th align="right">Encoded characters (byte
				  values in hex)</th><td colspan="2" align="center">0644</td><td align="center">0627</td><td align="center">0644</td><td align="center">0627</td><td align="center">0639</td><td align="center">0639</td></tr><tr><th align="right">Display</th><td colspan="7" align="center"><image><graphic source="images/arabe.gif" width="42" height="26"/> <alt>A few Arabic letters.</alt></image></td></tr></tbody></table> 
		<caption>Example A.4: Arabic</caption></figure> 
		<p>Here the first two keystrokes each produce an input character and an
		  encoded character, but the pair is displayed as a single glyph
		  ('<image><graphic source="images/lamalif.gif" width="15" height="25"/> 
		  <alt>Arabic lam-alef</alt></image>', a lam-alef ligature). The next keystroke
		  is a lam-alef, which some Arabic keyboards have; it produces the same two
		  characters which are displayed similarly, but this second lam-alef is placed to
		  the <emph>left</emph> of the first one when displayed. The last two keystrokes
		  produce two identical characters which are rendered by two different glyphs (a
		  medial form followed to its left by a final form). We thus have 5 keystrokes
		  producing 6 characters and 4 glyphs laid out right-to-left.</p> 
		<p id="sec-CharExamplesA5">A final example in Tamil, typed with an ISCII
		  keyboard, will illustrate some additional phenomena:</p><figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th align="right">Keystrokes</th><td align="center" colspan="2"><image><graphic source="images/ta-tm.gif" width="17" height="18"/><alt>Tamil ta</alt></image>
				  </td><td align="center"><image><graphic source="images/a-tm.gif" width="17" height="17"/> <alt>Tamil aa</alt></image></td><td align="center"><image><graphic source="images/na-tm.gif" width="18" height="18"/><alt>Tamil na</alt></image></td><td align="center"><image><graphic source="images/virama-tm.gif" width="10" height="19"/><alt>Tamil virama</alt></image></td><td align="center"><image><graphic source="images/ka-tm.gif" width="15" height="19"/><alt>Tamil ka</alt></image></td><td align="center"><image><graphic source="images/o-tm.gif" width="24" height="19"/><alt>Tamil o</alt></image></td></tr><tr><th align="right">Input characters</th><td colspan="2" align="center"><image><graphic source="images/ta-tm.gif" width="17" height="18"/> <alt>Tamil ta</alt></image></td><td align="center"><image><graphic source="images/a-tm.gif" width="17" height="17"/> <alt>Tamil aa</alt></image></td><td align="center"><image><graphic source="images/na-tm.gif" width="18" height="18"/><alt>Tamil na</alt></image></td><td align="center"><image><graphic source="images/virama-tm.gif" width="10" height="19"/><alt>Tamil virama</alt></image></td><td align="center"><image><graphic source="images/ka-tm.gif" width="15" height="19"/> <alt>Tamil ka</alt></image></td><td align="center"><image><graphic source="images/o-tm.gif" width="24" height="19"/><alt>Tamil o</alt></image></td></tr><tr><th align="right">Encoded characters (byte values
				  in hex)</th><td colspan="2" align="center">0B9F</td><td align="center">0BBE</td><td align="center">0B99</td><td align="center">0BCD</td><td align="center">0B95</td><td align="center">0BCB</td></tr><tr><th align="right">Display</th><td colspan="7" align="center"><image><graphic source="images/tango.gif" width="77" height="23"/> <alt>Tango in Tamil letters.</alt></image></td></tr></tbody></table> 
		<caption>Example A.5: Tamil</caption></figure> 
		<p>Here input is straightforward, but note that contrary to the preceding
		  accented Latin example, the diacritic '<image><graphic source="images/virama-tm.gif" width="10" height="19"/> <alt>Tamil virama</alt></image>' (<term>virama</term>, vowel killer) is entered
		  <emph>after</emph> the '<image><graphic source="images/na-tm.gif" width="18" height="18"/><alt>Tamil na</alt></image>' to which it applies. Rendering is
		  interesting for the last two characters. The last one ('<image><graphic source="images/o-tm.gif" width="24" height="19"/><alt>Tamil o</alt></image>')
		  clearly consists of two glyphs which <emph>surround</emph> the glyph of the
		  next to last character ('<image><graphic source="images/ka-tm.gif" width="15" height="19"/> <alt>Tamil ka</alt></image>').</p> 
		<p id="sec-CharExamplesA6">A number of operations routinely performed on
		  text can be impacted by the complexities of the world's writing systems. An
		  example is the operation of selecting text on screen by a pointing device in a
		  bidirectional (bidi) context (see <specref ref="sec-VisualRenderingUnits"/>).
		  Let's have a look at some bidi text, in this case Arabic letters (written
		  right-to-left) mixed with Arabic-Hindi digits (left-to-right):</p> 
		<figure><table border="1" cellpadding="5" cellspacing="0"><tbody><tr><th align="right">In memory</th><td colspan="2" align="center"><image><graphic source="images/bidiInMemory.gif" width="310" height="24"/><alt>A sequence of Arabic glyphs running left to right in the illustration. This includes four numeric glyphs at the end (on the right).</alt></image></td></tr><tr><th align="right">On screen</th><td colspan="2" align="center"><image><graphic source="images/sel_arabe1.gif" width="116" height="26"/><alt>The same sequence of glyphs but the character on the far left above is now on the far right, and all characters except the last four (numeric digits) follow in a right to left direction. The four numeric digits now run left to right from the far left hand side of the line.</alt></image></td></tr></tbody></table> 
		<caption>Example A.6: Bidirectional text</caption></figure> 
	 </div1> 
	 <div1 id="sec-Acknowledgements"><head>Acknowledgements</head> 
		<p>Special thanks go to Ian Jacobs for ample help with editing. Tim
		  Berners-Lee and James Clark provided important details in the section on URIs.
		  The W3C I18N WG and IG, as well as others, provided many comments and
		  suggestions.</p> 
	 </div1> 
	 <div1 id="sec-References"><head>References</head> 
		<div2 id="sec-NormativeReferences"><head>Normative
			 References</head><blist> 
			 <bibl id="iana" key="IANA">Internet Assigned Numbers Authority,
				<titleref href="ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets">Official
				Names for Character Sets</titleref>. (See 
				<loc href="ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets">ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets</loc>.)
				</bibl> 
			 <bibl id="iso10646" key="ISO/IEC 10646">ISO/IEC 10646-1:2000,
				<titleref href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">Information
				technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
				Architecture and Basic Multilingual Plane</titleref>, as, from time to time,
				amended, replaced by a new edition or expanded by the addition of new parts.
				(See 
				<loc href="http://www.iso.ch/">http://www.iso.ch/</loc> for the
				latest version.)</bibl> 
			 <bibl key="ISO/IEC 10646-1:2000" id="iso10646-2000">ISO/IEC
				10646-1:2000,
				<titleref href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">Information
				technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
				Architecture and Basic Multilingual Plane</titleref>. (See 
				<loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819</loc>.)
				</bibl> 
			 <bibl id="MIME" key="MIME"><titleref href="http://www.ietf.org/rfc/rfc2045.txt">Multipurpose Internet Mail
				Extensions (MIME). Part One: Format of Internet Message Bodies</titleref>, N.
				Freed, N. Borenstein, RFC 2045, November 1996, 
				<loc href="http://www.ietf.org/rfc/rfc2045.txt">http://www.ietf.org/rfc/rfc2045.txt</loc>.
				<titleref>Part Two: Media Types</titleref>, N. Freed, N. Borenstein, RFC 2046,
				November 1996. <titleref>Part Three: Message Header Extensions for Non-ASCII
				Text</titleref>, K. Moore, RFC 2047, November 1996. <titleref>Part Four:
				Registration Procedures</titleref>, N. Freed, J. Klensin, J. Postel, RFC 2048,
				November 1996. <titleref>Part Five: Conformance Criteria and
				Examples</titleref>, N. Freed, N. Borenstein, RFC 2049, November 1996. </bibl> 
			 <bibl id="RFC2070" key="RFC 2070">F. Yergeau, G. Nicol, G. Adams, M.
				Dürst, <titleref href="http://www.ietf.org/rfc/rfc2070.txt">Internationalization of the
				Hypertext Markup Language</titleref>, IETF RFC 2070, January 1997. (See 
				<loc href="http://www.ietf.org/rfc/rfc2070.txt">http://www.ietf.org/rfc/rfc2070.txt</loc>.)
				</bibl> 
			 <bibl id="rfc2119" key="RFC 2119">S. Bradner,
				<titleref href="http://www.ietf.org/rfc/rfc2119.txt">Key words for use in RFCs
				to Indicate Requirement Levels</titleref>, IETF RFC 2119. (See 
				<loc href="http://www.ietf.org/rfc/rfc2119.txt">http://www.ietf.org/rfc/rfc2119.txt</loc>.)
				</bibl> 
			 <bibl id="rfc2396" key="RFC 2396">T. Berners-Lee, R. Fielding, L.
				Masinter, <titleref href="http://www.ietf.org/rfc/rfc2396.txt">Uniform Resource
				Identifiers (URI): Generic Syntax</titleref>, IETF RFC 2396, August 1998. (See 
				<loc href="http://www.ietf.org/rfc/rfc2396.txt">http://www.ietf.org/rfc/rfc2396.txt</loc>.)
				</bibl> 
			 <bibl id="rfc2732" key="RFC 2732">R. Hinden, B. Carpenter, L.
				Masinter, <titleref href="http://www.ietf.org/rfc/rfc2732.txt">Format for
				Literal IPv6 Addresses in URL's</titleref>, IETF RFC 2732, 1999. (See 
				<loc href="http://www.ietf.org/rfc/rfc2732.txt">http://www.ietf.org/rfc/rfc2732.txt</loc>.)
				</bibl> 
			 <bibl id="unicode" key="Unicode">The Unicode Consortium,
				<titleref>The Unicode Standard -- Version 3.0</titleref>, ISBN 0-201-61633-5,
				as updated from time to time by the publication of new versions. (See 
				<loc href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/</loc>
				for the latest version and additional information on versions of the standard
				and of the Unicode Character Database).</bibl> 
			 <bibl id="unicode30" key="Unicode  3.0">The Unicode Consortium,
				<titleref>The Unicode Standard -- Version 3.0</titleref>, ISBN 0-201-61633-5.
				(See 
				<loc href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">http://www.unicode.org/unicode/standard/versions/Unicode3.0.html</loc>.)
				</bibl> 
			 <bibl id="UTR15" key="UTR #15">Mark Davis, Martin Dürst,
				<titleref href="http://www.unicode.org/unicode/reports/tr15/">Unicode
				Normalization Forms,</titleref> Unicode Standard Annex #15. (See 
				<loc href="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15</loc>
				for the latest version).
				<titleref href="http://www.unicode.org/unicode/reports/tr15/tr15-21.html">Version
				3.1.0</titleref> (March 2001) is at 
				<loc href="http://www.unicode.org/unicode/reports/tr15/tr15-21.html">http://www.unicode.org/unicode/reports/tr15/tr15-21.html</loc>.
				</bibl></blist> 
		</div2> 
		<div2 id="sec-OtherReferences"><head>Other References</head><blist> 
			 <bibl id="CharReq" key="CharReq">Martin J. Dürst,
				<titleref href="http://www.w3.org/TR/WD-charreq">Requirements for String
				Identity and Character Indexing Definitions for the WWW</titleref>, W3C Working
				Draft. (See 
				<loc href="http://www.w3.org/TR/WD-charreq">http://www.w3.org/TR/WD-charreq</loc>.)
				</bibl> 
			 <bibl id="connolly" key="Connolly">D. Connolly,
				<titleref href="http://www.w3.org/MarkUp/html-spec/charset-harmful">Character
				Set Considered Harmful</titleref>, W3C Note. (See 
				<loc href="http://www.w3.org/MarkUp/html-spec/charset-harmful">http://www.w3.org/MarkUp/html-spec/charset-harmful</loc>.)</bibl>
			 
			 <bibl id="css2" key="CSS2">Bert Bos, Håkon Wium Lie, Chris Lilley,
				Ian Jacobs, Eds., <titleref href="http://www.w3.org/TR/REC-CSS2/">Cascading
				Style Sheets, level 2</titleref> (CSS2 Specification), W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/REC-CSS2/">http://www.w3.org/TR/REC-CSS2</xspecref>.)
				</bibl> 
			 <bibl id="dom1" key="DOM Level 1">Vidur Apparao et al.,
				<titleref href="http://www.w3.org/TR/REC-DOM-Level-1/">Document Object Model
				(DOM) Level 1 Specification</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/REC-DOM-Level-1/">http://www.w3.org/TR/REC-DOM-Level-1/</xspecref>.)
				</bibl> 
			 <bibl id="html40" key="HTML 4.0">Dave Raggett, Arnaud Le Hors, Ian
				Jacobs, Eds., <titleref href="http://www.w3.org/TR/REC-html40-971218/">HTML 4.0
				Specification</titleref>, W3C Recommendation, 18-Dec-1997 (See
				<xspecref href="http://www.w3.org/TR/REC-html40-971218/">http://www.w3.org/TR/REC-html40-971218/</xspecref>.)</bibl>
			 
			 <bibl id="html401" key="HTML 4.01">Dave Raggett, Arnaud Le Hors, Ian
				Jacobs, Eds., <titleref href="http://www.w3.org/TR/html401/">HTML 4.01
				Specification</titleref>, W3C Recommendation, 24-Dec-1999. (See
				<xspecref href="http://www.w3.org/TR/html401/">http://www.w3.org/TR/html401/</xspecref>.)
				</bibl> 
			 <bibl id="uri-i18n" key="I-D URI-I18N">Larry Masinter, Martin Dürst,
				<titleref href="http://www.w3.org/International/2001/draft-masinter-url-i18n-08.txt">Internationalized
				Resource Identifiers (IRI)</titleref>, Internet-Draft, November 2001. (See 
				<loc href="http://www.w3.org/International/2001/draft-masinter-url-i18n-08.txt">http://www.w3.org/International/2001/draft-masinter-url-i18n-08.txt</loc>.)</bibl>
			 
			 <bibl id="i18nuri" key="Info URI-I18N"><titleref href="http://www.w3.org/International/O-URL-and-ident">Internationalization:
				URIs and other identifiers</titleref>. (See 
				<loc href="http://www.w3.org/International/O-URL-and-ident">http://www.w3.org/International/O-URL-and-ident</loc>.)
				</bibl> 
			 <bibl id="iso9541" key="ISO/IEC 9541-1">ISO/IEC 9541-1:1991, 
				<loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277">Information
				  technology -- Font information interchange -- Part 1: Architecture</loc>. (See 
				<loc href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277">http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277</loc>
				for the latest version.) </bibl> 
			 <bibl id="mathml2" key="MathML2">David Carlisle, Patrick Ion, Robert
				Miner, Nico Poppelier, Eds., <titleref href="http://www.w3.org/TR/MathML2/">Mathematical Markup Language (MathML)
				Version 2.0</titleref>, W3C Recommendation, 21 February 2001. (See 
				<loc href="http://www.w3.org/TR/MathML2/">http://www.w3.org/TR/MathML2/</loc>.)
				</bibl> 
			 <bibl id="Nicol" key="Nicol">Gavin Nicol,
				<titleref href="http://www.mind-to-mind.com/i18n/articles/multilingual/multilingual-www.html">The
				Multilingual World Wide Web</titleref>, Chapter 2: The WWW As A Multilingual
				Application. (See 
				<loc href="http://www.mind-to-mind.com/i18n/articles/multilingual/multilingual-www.html">http://www.mind-to-mind.com/i18n/articles/multilingual/multilingual-www.html</loc>.)
				</bibl> 
			 <bibl id="rfc2070" key="RFC 2070">F. Yergeau, G. Nicol, G. Adams, M.
				Dürst, <titleref href="http://www.ietf.org/rfc/rfc2070.txt">Internationalization of the
				Hypertext Markup Language</titleref>, IETF RFC 2070, January 1997. (See 
				<loc href="http://www.ietf.org/rfc/rfc2070.txt">http://www.ietf.org/rfc/rfc2070.txt</loc>.)</bibl>
			 
			 <bibl id="rfc2277" key="RFC 2277">H. Alvestrand,
				<titleref href="http://www.ietf.org/rfc/rfc2277.txt">IETF Policy on Character
				Sets and Languages</titleref>, IETF RFC 2277, BCP 18, January 1998. (See 
				<loc href="http://www.ietf.org/rfc/rfc2277.txt">http://www.ietf.org/rfc/rfc2277.txt</loc>.)
				</bibl> 
			 <bibl id="rfc2279" key="RFC 2279">F. Yergeau,
				<titleref href="http://www.ietf.org/rfc/rfc2279.txt">UTF-8, a transformation
				format of ISO 10646</titleref>, IETF RFC 2279, January 1998. (See 
				<loc href="http://www.ietf.org/rfc/rfc2279.txt">http://www.ietf.org/rfc/rfc2279.txt</loc>.)
				</bibl> 
			 <bibl key="RFC 2718" id="rfc2718">L. Masinter, H. Alvestrand, D.
				Zigmond, R. Petke, <titleref href="http://www.ietf.org/rfc/rfc2718.txt">Guidelines for new URL
				Schemes</titleref>, IETF RFC 2718, November 1999. (See 
				<loc href="http://www.ietf.org/rfc/rfc2718.txt">http://www.ietf.org/rfc/rfc2718.txt</loc>.)</bibl>
			 
			 <bibl id="rfc2781" key="RFC 2781">P. Hoffman, F. Yergeau,
				<titleref href="http://www.ietf.org/rfc/rfc2781.txt">UTF-16, an encoding of ISO
				10646</titleref>, IETF RFC 2781, February 2000. (See 
				<loc href="http://www.ietf.org/rfc/rfc2781.txt">http://www.ietf.org/rfc/rfc2781.txt</loc>.)</bibl>
			 
			 <bibl id="spread" key="SPREAD"><titleref href="http://www.ascc.net/xml/resource/entities/index.html">SPREAD -
				Standardization Project for East Asian Documents Universal Public Entity
				Set</titleref>. (See 
				<loc href="http://www.ascc.net/xml/resource/entities/index.html">http://www.ascc.net/xml/resource/entities/index.html</loc>)
				</bibl> 
			 <bibl id="svg" key="SVG">Jon Ferraiolo, Ed.,
				<titleref href="http://www.w3.org/TR/SVG/">Scalable Vector Graphics (SVG) 1.0
				Specification</titleref>, W3C Recommendation, 4 September 2001. (See 
				<loc href="http://www.w3.org/TR/SVG/">http://www.w3.org/TR/SVG/</loc>.) </bibl> 
			 <bibl id="UTR17" key="UTR #17">Ken Whistler, Mark Davis,
				<titleref href="http://www.unicode.org/unicode/reports/tr17/">Character
				Encoding Model</titleref>, Unicode Technical Report #17. (See 
				<loc href="http://www.unicode.org/unicode/reports/tr17/">http://www.unicode.org/unicode/reports/tr17/</loc>.)
				</bibl> 
			 <bibl id="UXML" key="UXML">Martin Dürst and Asmus Freytag,
				<titleref href="http://www.w3.org/TR/unicode-xml/">Unicode in XML and other
				Markup Languages</titleref>, Unicode Technical Report #20 and W3C Note. (See
				<xspecref href="http://www.w3.org/TR/unicode-xml/">http://www.w3.org/TR/unicode-xml</xspecref>.)</bibl>
			 
			 <bibl id="xlink" key="XLink">Steve DeRose, Eve Maler, David Orchard,
				Eds, <titleref href="http://www.w3.org/TR/xlink/">XML Linking Language (XLink)
				Version 1.0</titleref>, W3C Recommendation, 27 June 2001. (See 
				<loc href="http://www.w3.org/TR/xlink/">http://www.w3.org/TR/xlink/</loc>.) </bibl> 
			 <bibl id="xml10" key="XML 1.0">Tim Bray, Jean Paoli, C. M.
				Sperberg-McQueen, Eve Maler, Eds.,
				<titleref href="http://www.w3.org/TR/REC-xml">Extensible Markup Language (XML)
				1.0</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/REC-xml">http://www.w3.org/TR/REC-xml</xspecref>.)
				</bibl> 
			 <bibl key="XML Schema-2" id="xmlschema-2">Paul V. Biron , Ashok
				Malhotra , Eds., <titleref href="http://www.w3.org/TR/xmlschema-2/">XML Schema
				Part 2: Datatypes</titleref>, W3C Recommendation. (See
				<xspecref href="http://www.w3.org/TR/xmlschema-2/">http://www.w3.org/TR/xmlschema-2</xspecref>.)</bibl>
			 
			 <bibl id="XML_Japanese_profile" key="XML Japanese Profile">MURATA
				Makoto Ed., <titleref href="http://www.w3.org/TR/japanese-xml/">XML Japanese
				Profile</titleref>, W3C Note. (See 
				<loc href="http://www.w3.org/TR/japanese-xml/">http://www.w3.org/TR/japanese-xml/</loc>.)
				</bibl> 
			 <bibl id="xpath" key="XPath">James Clark, Steve DeRose, Eds,
				<titleref href="http://www.w3.org/TR/xpath">XML Path Language (XPath) Version
				1.0</titleref>, W3C Recommendation, 16 November 1999. (See
				<xspecref href="http://www.w3.org/TR/xpath">http://www.w3.org/TR/xpath</xspecref>.)</bibl>
			 
			 <bibl id="xpointer" key="XPointer">Steve DeRose, Eve Maler, Ron
				Daniel Jr., Eds, <titleref href="http://www.w3.org/TR/2001/CR-xptr-20010911/">XML Pointer Language
				(XPointer) Version 1.0</titleref>, W3C Candidate Recommendation, 11 September
				2001. (See <xspecref href="http://www.w3.org/TR/2001/CR-xptr-20010911/">http://www.w3.org/TR/2001/CR-xptr-20010911/</xspecref>.)
				</bibl></blist> 
		</div2> 
	 </div1> 
	 <div1 id="sec-Changes"><head>Change Log (Non-Normative)</head> 
		<div2 id="sec-Changes19991129"><head>Changes since 
			 <loc href="http://www.w3.org/TR/2001/WD-charmod-20010928/">http://www.w3.org/TR/2001/WD-charmod-20010928</loc></head>
		  
		  <p>Replaced much of chapter 8 content with references to
			 <bibref ref="uri-i18n"/>.</p> 
		  <p>Made numerous further changes listed in
			 <titleref href="http://www.w3.org/International/Group/charmod-lc/">Character
			 Model for the World Wide Web 1.0 Last Call Comments</titleref> (Members
			 only).</p> 
		  <p>Converted to XHTML with UTF-8 encoding.</p> 
		</div2> 
		<div2 id="sec-Changes20010126"><head>Changes since 
			 <loc href="http://www.w3.org/TR/2001/WD-charmod-20010126/">http://www.w3.org/TR/2001/WD-charmod-20010126</loc></head>
		  
		  <p>Normalization: changed from <quote>recipients <rfc2119>MUST
			 NOT</rfc2119> normalize" to "recipients <rfc2119>MUST</rfc2119> check and
			 reject un-normalized data</quote>.</p> 
		  <p>Clarified conformance model, in particular introduced [S][I][C]
			 specifiers for requirements.</p> 
		  <p>Made numerous other changes listed in
			 <titleref href="http://www.w3.org/International/Group/charmod-lc/">Character
			 Model for the World Wide Web 1.0 Last Call Comments</titleref> (Members
			 only).</p> 
		  <p>Fixed countless typos and unclear/ambiguous sentences.</p> 
		  <p>Updated references.</p> 
		</div2> 
	 </div1> 
  </back></spec> 
