CharmodNormProposal2013

From Internationalization

General Requirements

This document contains ONLY the requirements from Charmod-Norm. The actual text is located at:

Editor's Unofficial Copy: [ http://inter-locale.com/w3c/charmod-norm-1.1-draft.html ]

All content

[C] Text content SHOULD be stored and exchanged in Unicode Normalization Form C (NFC).

NOTE: In order to be processed correctly content must use a consistent sequence of code points to represent text. While content can be in any normalization form or may use a de-normalized (but valid) Unicode character sequence, inconsistency of representation will cause implementations to treat the different sequence as "different". The best way to ensure consistent selection, access, extraction, processing, or display is to always use NFC.

[I] Implementations which transcode text from a legacy encoding to a Unicode encoding form SHOULD use a normalizing transcoder that produces Unicode Normalization Form C (NFC).

[C] Content developers SHOULD avoid composing characters at the beginning of constructs that might be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup.

Formal Languages (markup or programming languages)

[C] Identifiers in content SHOULD use consistent case (upper, lower, mixed case) to facilitate matching.

[S] Specifications of text-based formats and protocols MAY specify that the format or protocol requires content to be in Unicode Normalization Form C (NFC).

NOTE: specifying NFC requires additional care on the part of the spec developer, as content on the Web generally is not in a known normalization state. Boundary and error conditions for denormalized content ought to be carefully considered and well specified in these cases.

[S] Specifications SHOULD NOT specify case-insensitive comparison of strings.

[S] Specifications that specify case-insensitive comparison SHOULD specify Unicode C+F case folding.

In some limited cases, locale- or language-specific tailoring might also be appropriate. However, such cases are generally linked to natural language processing operations. Because they produce potentially different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.

[S][I] Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of:

  • ASCII-only case-sensitive (ACS)
  • ASCII-only case-insensitive (ACI)
  • Unicode case-sensitive (UniCS)
  • Unicode case-insensitive using Unicode case-folding CF (UniCF)
  • Unicode case-insensitive locale-specific case-folding (UniLoc)

[S][I] Specifications and implementations MUST NOT specify ASCII-only case-sensistive or case-insensitive (ASC or ACI) forms for values or constructs that permit non-ASCII characters.

Natural Language Processing (textual content not defined by a formal language)

This section relates to how user-agents process natural language text. For example, the "find" command in a browser or the way a search feature might work:

...

[S][I] Specifications or implementations MAY use locale- or language-dependent case folding for matching.

... case fold matching... ... normalization forms for string search... ... promiscuous vs. non-promiscuous matching...

One "string search" algorithm might include:

  • For each input string and for each item in the search corpus:
  1. Transcode to a sequence of Unicode code points
  2. Remove all markup
  3. Unescape all entity and numeric character references
  4. Normalize input string to Unicode Normalization Form NFKD (compatibility decomposition) (this removes differences that might be important. NFD may be substituted)
  5. Transliterate any Hiragana to Katakana
  6. Remove all characters: //:format://
  7. Coalesce all sequences of the following character classes to a single space: //:control://:separator://:surrogate://
  8. (optional) Coalesce symbol, punctuation to space
  9. Unicode case fold to form UniCF
  10. (optional) Remove combining marks (Unicode General Category Mn) whose script is Common or Latin
  11. Normalize to Unicode Normalization Form C (NFC)
  • perform substring matching

Non-Normalizing Specifications

Any specification that does not specify normalization explicitly (and all new specifications) is required to follow this set of specifications:

[S] Specifications that do not normalize MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.

[S][I] Specifications and implementations MUST NOT assume that content is in any particular normalization form. The normalization form or lack of normalization for any given content has to be considered intentional.

[S][I] For namespaces and values that are restricted to the Basic Latin (ASCII) subset of Unicode, ACI and ACS matching MAY be specified.

[S][I] For namespaces and values that are not restricted to Basic Latin (ASCII), case-insensitive matching MUST specify either UniCF or locale-sensitive string comparison.

[I] Implementations MUST NOT alter the normalization form of content being exchanged, read, parsed, or processed as content might depend on the de-normalized representation.

[S] Specifications MUST specify that string matching takes the form of "code point-by-code point" comparison of the Unicode character sequence, or, if a specific Unicode character encoding is specified, code unit-by-code unit comparison of the sequences.

Unicode Normalizing Specifications

For specifications of text-based formats and protocols that have already defined Unicode Normalization as a requirement, the following requirements apply:

[S] Specifications of text-based formats and protocols MAY, as part of their syntax definition, require that the text be in normalized form. Any such specification MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC.

[S] [I] A text-processing component which receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

[I] A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

[S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) , SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters.

[I] Authoring tool implementations for a (formal) language that does not mandate full-normalization SHOULD either prevent users from creating content with composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup, or SHOULD warn users when they do so.

[S] Where operations may produce unnormalized output from normalized text input, specifications of API components (functions/methods) that implement these operations MUST define whether normalization is the responsibility of the caller or the callee. Specifications MAY state that performing normalization is optional for some API components; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. Specifications SHOULD NOT make the implementation of normalization optional.

[S] Specifications that define a mechanism (for example an API or a defining language) for producing textual data object SHOULD require that the final output of this mechanism be normalized.

Obsolete?

[S] [I] Specifications and implementations MUST document any deviation from the above requirements.

[S] Specifications MUST document any known security issues related to normalization.