CharmodNormProposal2013

From Internationalization
Revision as of 21:10, 8 March 2013 by Aphillip (Talk | contribs)

Jump to: navigation, search

General Requirements

[C] Text content SHOULD be stored and exchanged in Unicode Normalization Form C (NFC).

NOTE: In order to be processed correctly content must use a consistent sequence of code points to represent text. While content can be in any normalization form or may use a de-normalized (but valid) Unicode character sequence, inconsistency of representation will cause implementations to treat the different sequence as "different". The best way to ensure consistent selection, access, extraction, processing, or display is to always use NFC.

[C] Identifiers in content SHOULD use consistent case (upper, lower, mixed case) to facilitate matching.

[I] Implementations which transcode text from a legacy encoding to a Unicode encoding form SHOULD use a normalizing transcoder that produces Unicode Normalization Form C (NFC).

[C] Content developers SHOULD avoid composing characters at the beginning of constructs that might be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup.

[S] Specifications of text-based formats and protocols MAY specify that the format or protocol requires content to be in Unicode Normalization Form C (NFC).

NOTE: specifying NFC requires additional care on the part of the spec developer, as content on the Web generally is not in a known normalization state. Boundary and error conditions for denormalized content ought to be carefully considered and well specified in these cases.

[S] Specifications SHOULD NOT specify case-insensitive comparison of strings.

[S] Specifications that specify case-insensitive comparison SHOULD specify either Unicode C+F case folding or locale-specific tailorings thereof.

[S][I] Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of:

  • ASCII-only case-sensitive (ACS)
  • ASCII-only case-insensitive (ACI)
  • Unicode case-sensitive (UniCS)
  • Unicode case-insensitive using Unicode case-folding CF (UniCF)
  • Unicode case-insensitive locale-specific case-folding (UniLoc)

[S][I] Specifications and implementations MUST NOT specify ASCII-only case-sensistive or case-insensitive (ASC or ACI) forms for values or constructs that permit non-ASCII characters.


Non-Normalizing Specifications

Any specification that does not specify normalization explicitly (and all new specifications) is required to follow this set of specifications:

[S] Specifications that do not normalize MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.

[S][I] Specifications and implementations MUST NOT assume that content is in any particular normalization form. The normalization form or lack of normalization for any given content has to be considered intentional.

[S][I] For namespaces and values that are restricted to the US-ASCII subset of Unicode, ACI and ACS matching MAY be specified.

[S][I] For namespaces and values that are not restricted to US-ASCII, case-insensitive matching MUST specify either UniCF or locale-sensitive string comparison.

[I] Implementations MUST NOT alter the normalization form of content being exchanged, read, parsed, or processed as content might depend on the de-normalized representation.

[S] Specifications MUST specify that string matching takes the form of "codepoint-by-codepoint" comparison of the Unicode character sequence, or, if a specific Unicode character encoding is specified, "byte-by-byte" (or rather code unit-by-code unit) comparison of the sequences.

Unicode Normalizing Specifications

For specifications of text-based formats and protocols that have already defined Unicode Normalization as a requirement, the following requirements apply:

[S] Specifications of text-based formats and protocols MAY, as part of their syntax definition, require that the text be in normalized form. Any such specification MUST define string matching in terms of normalized string comparison and MUST define the normalized form to be NFC.

[S] [I] A text-processing component which receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

[I] A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

[S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) , SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters.

[I] Authoring tool implementations for a (formal) language that does not mandate full-normalization SHOULD either prevent users from creating content with composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup, or SHOULD warn users when they do so.

[S] Where operations may produce unnormalized output from normalized text input, specifications of API components (functions/methods) that implement these operations MUST define whether normalization is the responsibility of the caller or the callee. Specifications MAY state that performing normalization is optional for some API components; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. Specifications SHOULD NOT make the implementation of normalization optional.

[S] Specifications that define a mechanism (for example an API or a defining language) for producing textual data object SHOULD require that the final output of this mechanism be normalized.

Obsolete?

[S] [I] Specifications and implementations MUST document any deviation from the above requirements.

[S] Specifications MUST document any known security issues related to normalization.