Charmod norm

Charmod Normalization

C300 [C] Text content SHOULD be in fully-normalized form and if not SHOULD at least be in include-normalized form.

C301 [S] Specifications of text-based formats and protocols SHOULD, as part of their syntax definition, require that the text be in normalized form.

C302 [S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.

C303 [I] A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

C304 [S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization. These definitions SHOULD include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) , SHOULD include any other boundary that may create denormalization when instances of the language are processed, but SHOULD NOT include character escapes designed to express arbitrary characters.

C305 [C] Even when authoring in a (formal) language that does not mandate full-normalization, content developers SHOULD avoid composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup.

C306 [I] Authoring tool implementations for a (formal) language that does not mandate full-normalization SHOULD either prevent users from creating content with composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup, or SHOULD warn users when they do so.

C307 [I] Implementations which transcode text from a legacy encoding to a Unicode encoding form SHOULD use a normalizing transcoder.

C308 [S] Where operations may produce unnormalized output from normalized text input, specifications of API components (functions/methods) that implement these operations MUST define whether normalization is the responsibility of the caller or the callee. Specifications MAY state that performing normalization is optional for some API components; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. Specifications SHOULD NOT make the implementation of normalization optional.

C309 [S] Specifications that define a mechanism (for example an API or a defining language) for producing textual data object SHOULD require that the final output of this mechanism be normalized.

C310 [S] [I] Specifications and implementations MUST document any deviation from the above requirements.

C311 [S] Specifications MUST document any known security issues related to normalization.

New normalization

C300 [C] Language constructs SHOULD be in fully-normalized form and if not SHOULD at least be in include-normalized form.

This tries to address the idea that the markup should be normalised at all times, but people should be able to store unnormalized 'content' if they want. Unfortunately, this needs some more careful thought - for example, what about class attribute values? Is that content or markup (cf. title attribute values)? In practice describing this may turn out to be very difficult to do.

String identity matching

C312 [S] [I] String identity matching MUST be performed as if the following steps were followed:

  1. Early uniform normalization to fully-normalized form, as defined in Section 3.2.4: Fully-normalized text. In accordance with section Section 3: Normalization, this step MUST be performed by the producers of the strings to be compared.
  2. Conversion to a common Unicode encoding form, if necessary.
  3. Expansion of all recognized character escapes and includes.
  4. Testing for bit-by-bit identity.

C313 [S] [I] Forms of string matching other than identity matching SHOULD be performed as if the following steps were followed:

  1. Steps 1 to 3 for string identity matching.
  2. Matching the strings in a way that is appropriate to the application.