Character Model for the World Wide Web

Normalization

This document is in two sections. The first section is verbatim from CharMod-Norm and consists of the requirements sections at the end. The second section is the proposed replacement recommendations, plus any explanatory text.

Internationalization WG members are invited to insert comments during the review period. Please make comments between paragraphs (following the paragraph you are commenting on). Insert two spaces and then follow with your name/id enclosed in slash characters. It'll look like this when you're done:

 // addison: this is an example of how a comment might look when you're done with it.

This page is live and edits will happen more-or-less continuously as we work through it.

I have started to number the proposed recommendations. This is for reference.

Requirements in both sections can apply to specifications ([S]), implementations ([I]), or content ([C]).. or any combination of these. Requirements have the bracketed letter shown next to the number to denote which type(s) of requirement they are.

CAPITALIZED words are, of course, RFC2119 keywords with their usual meaning. lowercase words that happen to be listed in RFC2119 are not 2119-keywords, although an effort has been made to avoid using these words outside of a normative context.

Original Recommendations

C300 [C] Text content SHOULD be in fully-normalized form and if not SHOULD at least be in include-normalized form.

C301 [S] Specifications of text-based formats and protocols SHOULD, as part of their syntax definition, require that the text be in normalized form.

C302 [S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

C303 [I] A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

EXAMPLE: If the 'z' is deleted from the (normalized) string cz¸ (where '¸' represents a combining cedilla, U+0327), normalization is necessary to turn the denormalized result c¸ into the properly normalized ç. If the software that deletes the 'z' later uses the string in a normalization-sensitive operation, it needs to normalize the string before this operation to ensure correctness; otherwise, normalization may be deferred until the data is exposed. Analogous cases exist for insertion and concatenation (e.g. xf:concat(xf:substring('cz¸', 1, 1), xf:substring('cz¸', 3, 1)) in XQuery [XQuery Operators]).

NOTE: Software that denormalizes a string such as in the deletion example above does not need to perform a potentially expensive re-normalization of the whole string to ensure that the string is normalized. It is sufficient to go back to the last non-composing character and re-normalize forward to the next non-composing character; if the string was normalized before the denormalizing operation, it will now be re-normalized.

C304 [S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has an include mechanism). These definitions SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters.

C305 [C] Even when authoring in a (formal) language that does not mandate full-normalization, content developers SHOULD avoid composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup.

C310 [S] [I] Specifications and implementations MUST document any deviation from the above requirements.

C311 [S] Specifications MUST document any known security issues related to normalization.

C312 [S] [I] String identity matching MUST be performed as if the following steps were followed:

Early uniform normalization to fully-normalized form, as defined in 3.2.4 Fully-normalized text. In accordance with section 3 Normalization, this step MUST be performed by the producers of the strings to be compared.

Conversion to a common Unicode encoding form, if necessary.

Expansion of all recognized character escapes and includes.

Testing for bit-by-bit identity.

Step 1 ensures 1) that the identity matching process can produce correct results using the next three steps and 2) that a minimum of effort is spent on solving the problem.

NOTE: The expansion of character escapes and includes (step 3 above) is dependent on context, i.e. on which markup or programming language is considered to apply when the string matching operation is performed. Consider a search for the string 'suçon' in an XML document containing suçon but not suçon. If the search is performed in a plain text editor, the context is plain text (no markup or programming language applies), the ç character escape is not recognized, hence not expanded and the search fails. If the search is performed in an XML browser, the context is XML, the character escape (defined by XML) is expanded and the search succeeds.

An intermediate case would be an XML editor that purposefully provides a view of an XML document with entity references left unexpanded. In that case, a search over that pseudo-XML view will deliberately not expand entities: in that particular context, entity references are not considered includes and need not be expanded.

C313[S] [I] Forms of string matching other than identity matching SHOULD be performed as if the following steps were followed:

Steps 1 to 3 for string identity matching.
Matching the strings in a way that is appropriate to the application.

Appropriate methods of matching text outside of string identity matching can include such things as case-insensitive matching, accent-insensitive matching, matching characters against Unicode compatibility forms, expansion of abbreviations, matching of stemmed words, phonetic matching, etc.

EXAMPLE: A user who specifies a search for the string suçon against a Unicode encoded XML document would expect to find string identity matches against the strings suçon, suçon and su&ccedill;on (where the entity ç represents the precomposed character 'ç'). Identity matches should also be found whether the string was encoded as 73 75 C3 A7 6F 6E (in UTF-8) or 0073 0075 00E7 006F 006E (in UTF-16), or any other character encoding that can be transcoded into normalized Unicode characters.

It should never be the case that a match would be attempted against strings such as suçon or suc¸on since these are not fully-normalized and should cause the text to be rejected. If, however, matching is done against such strings they should also match since they are canonically equivalent.

Forms of matching other than identity, if supported by the application, would have to be used to produce a match against the following strings: SUÇON (case-insensitive matching), sucon (accent-insensitive matching), suçons (matched stems), suçant (phonetic matching), etc.

DRAFT Recommendations

C001 [S] Specifications SHOULD NOT require documents to be stored in or transcoded to any particular normalization form.

C002 [S] Specifications MAY require specific fields or values within a document to be normalized. Such a requirement MUST include the normalization form to be used. NFC is RECOMMENDED as a default.

C003 [S] Specifications MAY require that specific operations (such as string comparison) be "normalizing" operations. Such a specification MUST define whether the normalization uses a canonical (NFC/NFD) for or a compatibility (NFKC/NFKD) form.

C004 [S] Specifications defining namespaces or identifiers SHOULD require matching and comparison of names and identifiers to be normalizing operations. Requirements set forth in UAX31 are RECOMMENDED when defining acceptable values for namespaces and identifiers.

 UAX#31 is "Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax"

A specification is considered to be in compliance with this requirement if it defines the full set of valid values such that only normalized name constructs are possible. For example, a specification that defines a set of identifiers, all of whose names are constructed from US-ASCII characters, is in compliance because all of the names are normalized.

Note that "namespaces and identifiers" may include element names, attribute names, and attribute values in various XML-based markup languages.

C005 [C] Textual content SHOULD be created, stored, and interchanged in Unicode Normalization Form C (NFC), except where it interferes with the author's intentions. give examples

 // kojiishi: NFC/NFD can break CJK Compatibility Ideographs (U+F900-FAFF, U+2F800-2FAFF). One example is U+FA19 神 becomes U+795E 神.
 // kojiishi: Modified NFD in Apple HFS+ excludes U+2000-2FFF in addition to CJK Compatibility Ideographs. Still under investigation.

The above requirement is intended to promote the (continued) use of normalized content forms except when there is a good reason not to use one. Some languages (such as Vietnamese) often need to use denormalized formats due to input, encoding, or storage issues. In other cases, authors may choose specific characters with NFC mappings for presentational or demonstrative reasons. Most content in most languages--even those in legacy character encodings--are already in NFC.

C006 [S] Specifications MUST define how to parse or process document formats such that non-normalized content is handled correctly.

define

A "normalization-sensitive operation" is one whose results may differ when normalization is applied to content.

define

A "normalizing operation" is one whose results are normalization sensitive and which fully-normalizes the text on which it operates.

C007 [I] Except when performing a normalizing operation, a text-processing component MUST NOT normalize text when parsing, storing, or processing content. The results of any such operation are dependent upon the code points encoded and, as a result, visually and semantically identical strings might be considered distinct.

C008 [S] Specifications SHOULD NOT require any normalizing operation on text displayed to the user; internally stored values MAY be normalized as long as they are not later displayed.

C009 [I] Any implementation of a normalizing operation SHOULD normalize the text internally rather than modifying the original content. The results of each step in the operation MUST behave as if the original text had be normalized from the outset. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

C010 [S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to process the language, protocol, or document format in the absence of normalized textual content. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has an include mechanism), SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters.

Even when authoring in a (formal) language that does not mandate full-normalization, content developers SHOULD avoid composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup.

Authoring tool implementations for a (formal) language that does not mandate full-normalization SHOULD either prevent users from creating content with composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup, or SHOULD warn users when they do so.

Implementations which transcode text from a legacy encoding to a Unicode encoding form SHOULD use a normalizing transcoder.

NOTE: Except when an encoding's repertoire contains characters not represented in Unicode, it is always possible to construct a normalizing transcoder by using any transcoder followed by a normalizer.

Specifications of API components (functions/methods) MAY optionally require that normalization is performed for normalization-sensitive operations; the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off.

EXAMPLE: The concatenation operation may either concatenate sequences of codepoints without normalization at the boundary, or may take normalization into account to avoid producing unnormalized output from normalized input. An API specification must define whether the operation normalizes at the boundary or leaves that responsibility to the application using the API.

Specifications that define a mechanism (for example an API or a defining language) for producing textual data object MAY require that the final output of this mechanism be normalized.

Specifications and implementations MUST document any deviation from the above requirements.

Specifications MUST document any known security issues related to normalization.

String Identity Matching

S001 [S] [I] String identity matching can be "normalizing" or "non-normalizing". For both forms, string identity matching MUST follow the following steps:

Conversion to a common Unicode encoding form, if necessary.
Expansion of all recognized character escapes and includes.
Testing for bit-by-bit identity.

For normalizing comparison, such as recommended for namespaces and identifiers, the steps are slightly different:

Conversion to a common Unicode encoding form, if necessary.
Expansion of all recognized character escapes and includes.
Normalization of the resulting Unicode code point sequence normalization form
Testing for bit-by-bit identity.

Note that the normalized strings are not required to be stored or written back. Fast normalization checking is permitted and is an appropriate performance improvement.

S002 [S] [I] Forms of string matching other than identity matching SHOULD be performed as if the following steps were followed:

Steps 1 to 3 for string identity matching.
Matching the strings in a way that is appropriate to the application.